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Abstract 


A  processing  scheme  for  speech  signals  is  proposed  that  emulates  syn¬ 
chrony  capture  in  the  auditory  nerve.  The  role  of  stimulus- locked  spike 
timing  is  important  for  representation  of  stimulus  periodicity,  low  fre¬ 
quency  spectrum,  and  spatial  location.  In  synchrony  capture  dominant 
single  frequency  components  in  each  frequency  region  impress  their 
time  structures  on  temporal  firing  patterns  of  auditory  nerve  fibers 
(ANFs)  with  nearby  characteristic  frequencies  (CFs).  At  low  frequen¬ 
cies,  for  voiced  sounds,  synchrony  capture  divides  the  nerve  into  dis¬ 
crete  CF  territories  associated  with  individual  harmonics.  An  adap¬ 
tive,  synchrony  capture  hlterbank  (SCFB)  consisting  of  a  fixed  array 
of  traditional,  passive  linear  (gammatone)  filters  cascaded  with  a  bank 
of  adaptively  tunable,  bandpass  filter  triplets  is  proposed.  Differences 
in  triplet  output  envelopes  steer  triplet  center  frequencies  via  voltage 
controlled  oscillators  (VCOs).  The  SCFB  exhibits  some  cochlea- like 
responses,  such  as  two-tone  suppression  and  distortion  products,  and 
possesses  many  desirable  properties  for  processing  speech,  music,  and 
natural  sounds.  Strong  signal  components  dominate  relatively  greater 
numbers  of  filter  channels,  thereby  yielding  robust  encodings  of  relative 
component  intensities.  The  VCOs  precisely  lock  onto  harmonics  most 
important  for  formant  tracking,  pitch  perception,  and  sound  separa¬ 
tion. 

PACS  numbers:  43.72  Ar,  43.64  Bt,  43.64  Sj 


2 


Synchrony  capture  filterbank  (SCFB):  Auditory-inspired  signal  processing  for 
tracking  individual  components  in  speech 


I.  INTRODUCTION 

For  the  past  three  decades  there  has  been  significant  interest  in  developing  computational 
signal  processing  models  based  on  the  physiology  of  the  cochlea  and  auditory  nerve  (AN)1. 
The  hope  has  been  that  artificial  systems  can  be  designed  and  built  using  signal  process¬ 
ing  strategies  gleaned  from  nature  that  can  equal  or  exceed  human  auditory  performance. 
Our  work  in  this  area  is  motivated  by  neurophysiological  observations  of  the  synchrony  cap¬ 
ture  phenomenon  in  the  auditory  nerve  that  were  originally  reported  by  Sachs  et  al.2  and 
Dclgutte  et  al.3.  This  paper  proposes  such  a  biologically-inspired  signal  processing  strategy 
for  processing  speech  and  audio  signals. 

If  one  systematically  examines  the  temporal  representation  of  low  harmonics  of  complex 
sounds  in  the  auditory  nerve,  synchrony  capture  is  a  striking  feature.  Synchrony  capture 
means  that  the  dominant  component  in  a  given  frequency  band  preferentially  drives  audi¬ 
tory  nerve  fibers  innervating  the  entire  corresponding  frequency  region  of  the  cochlea3.  Here, 
virtually  all  fibers  innervating  this  cochlear  place  region,  i.e.  those  with  CFs  in  the  vicinity 
of  the  frequency  of  the  dominant  component,  synchronize  exclusively  to  the  dominant  com¬ 
ponent,  in  spite  of  the  presence  of  other  nearby  weaker  components  that  may  be  closer  to 
their  CFs.  At  moderate  and  high  sound  pressure  levels,  fibers  spanning  an  entire  octave  or 
more  of  CF  are  typically  driven  at  their  maximal  rates  and  exhibit  firing  patterns  related 
to  a  single,  dominant  component  in  each  formant  region.  Because  of  the  symmetric  nature 
of  cochlear  tuning,  this  dominant  component  mostly  drives  fibers  whose  CFs  lie  above  it  in 
frequency.  Figures  1  and  2  provide  examples  of  this  phenomenon  in  slightly  different  forms. 
Figure  la  shows  peristimulus  time  histograms  (PSTHs)  for  a  five-formant  synthetic  vowel 
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sound.  Sharp  boundaries  characteristic  of  synchrony  capture  are  seen  between  the  different 


CF  regions  driven  by  different  dominant,  formant-region  harmonics  of  the  multi-formant 


vowel.  Note  that  in  Figure  la  other  non-dominant  harmonics  in  the  vowel  formant  regions 


are  not  explicitly  represented. 
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FIG.  1.  Two  views  of  the  representation  of  vowel-like  sounds  in  the  AN.  a)  Peristimulus 
time  histograms  for  cat  ANF  arranged  by  characteristic  frequency  in  response  to  the  onset 
of  a  five-formant  synthetic  vowel  (/da/)  reprinted  from  Seeker- Walker  and  Searle  (1990)4. 
(b)  Distribution  of  synchronized  rates  in  ANFs  in  response  to  a  standard  vowel  /da/  with 
three  formants  F\ ,  F2l  and  F:i.  F0  =100Hz.  Reprinted  from  Sachs  et  al.  (2002)5. 


Figure  lb  summarizes  temporal  firing  patterns  observed  in  the  cat  auditory  nerve  in 
response  to  a  three-formant  synthetic  vowel5.  Relative  synchronized  rates  of  fibers  to  dif¬ 
ferent  component  frequencies  are  shown  as  a  function  of  fiber  characteristic  (CF)  or  best 
frequency  (BF).  Sizes  of  squares  indicate  synchronized  rates  (larger  squares  =  higher  rates). 
The  diagonal  gray  band  shows  regions  where  temporal  firing  periodicities  match  fiber  BFs, 
and  the  dark  horizontal  swaths  indicate  capture  of  fibers  over  a  range  of  fiber  best  frequen¬ 
cies  by  individual  stimulus  components.  The  most  prominent  swaths  are  the  synchrony 
capture  regions  for  the  dominant  harmonics  associated  with  each  of  the  three  formants  (en- 
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closed  boxes).  In  addition  to  capture  by  dominant  harmonics  in  formant  regions,  low-CF 
fibers  show  synchrony  to  less-intense,  non-formant,  low  harmonics  (n=l-3)  when  frequen¬ 
cies  of  those  harmonics  happen  to  be  near  their  respective  CFs  (dark  boxes  within  the  gray 
diagonal  band). 


(a)  AFO  =  6.6%  FOs  =  440  &  469  Hz 
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FIG.  2.  Synchrony  capture  of  adjacent  partials  for  two  frequency  separations.  The  two 
neurograms  show  all-order  interspike  interval  distributions  for  individual  cat  auditory  nerve 
fibers  as  a  function  of  CF  in  response  to  complex  tone  dyads  presented  100  times  at  60 
dB  SPL.  Each  tone  of  the  pair  consisted  of  equal  amplitude  harmonics  1-6.  New  analysis 
of  dataset  originally  reported  in  Tramo  et  al.  (2001)6.  (a)  Responses  to  a  tone  dyad  a 
musical  minor  second  apart  (16:15,  A.F0=6.6%).  Vertial  bars  indicate  CF  regions  where 
one  predominant  interspike  interval  pattern  predominates.  The  CFs  of  the  fibers  shown 
are:  153,  283,  309,  345,  350,  355,  369,  402,  402,  431,  451,  530,  588,  602,  631,  660,  724, 
and  732  Hz.  Misordered  interval  patterns  (single-asterisked  histograms)  are  likely  due  to 
small  CF  measurement  errors,  (b)  Response  to  a  tone  dyad  a  musical  fourth  apart  (4:3, 
AF0=33.3%).  Three  distinct  interspike  interval  patterns  associated  with  individual  partials 
(440,  587,  and  880  Hz)  are  produced  in  different  CF  bands,  with  abrupt  transitions  between 
response  modes.  One  fiber  shows  locking  to  distortion  product  2/i  —  f2  near  its  CF  (double- 
asterisked  histogram,  2f\  —  f2  —  293  Hz,  CF  =  283  Hz).  Fiber  CFs  were  153,  283,  346,  350, 
355,  369,  402,  402,  431,  451,  530,  588,  602,  631,  660,  662,  724,  732,  and  732  Hz. 
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Synchrony  capture  is  most  directly  apparent  when  distributions  of  all-order  interspike 
intervals  (spike  autocorrelation  histograms)  produced  by  individual  fibers  are  plotted  as  a 
function  of  fiber  CF  (cochlear  place)'.  Figure  2  shows  fiber  interspike  interval  patterns  in 
response  to  two  concurrent  complex  harmonic  tones  (n=  1-6).  For  a  stimulus  in  which  pairs 
of  harmonics  are  close  together  (Figure  2a,  AFq=  6.6%  of  Fo),  all  of  the  fibers  in  the  region 
synchronize  to  the  composite,  modulated  waveform.  In  this  case,  the  temporal  firing  patterns 
in  the  whole  CF  region  follow  the  beating  of  the  adjacent  partials,  producing  low-frequency 
fluctuations  in  firing  rate  that  are  associated  with  perceived  roughness6.  Here,  when  the 
adjacent  partials  are  sufficiently  close  together  there  are  no  separate  temporal,  interspike 
interval  representations  of  individual  harmonics  themselves.  On  the  other  hand,  for  a  tone 
pair  for  which  the  lower  harmonics  are  relatively  well  separated  in  frequency  (Figure  2b, 
A F0  =  33.3%  of  F0),  different  CF  regions  are  captured  by  one  or  another  partial.  Thus  each 
harmonic  component  drives  a  discrete  region  of  the  cochlea  in  which  its  temporal  pattern 
dominates,  with  almost  no  zones  of  beating  (right  panel,  there  are  different  CF  zones  with 
different  interval  peak  patterns).  The  result  is  that  each  individual  partial  has  its  own  swath 
of  auditory  nerve  fibers  that  produce  corresponding  interspike  interval  patterns. 

The  foregoing  examples  indicate  that  auditory  nerve  fibers  synchronize  preferentially  to 
dominant  components  in  the  signal.  In  signal  processing  terms  the  peripheral  auditory  sys¬ 
tem  appears  to  treat  these  dominant  components  as  “carrier”  frequencies.  The  effects  of  the 
weaker  surrounding  components  (other  harmonics)  then  manifest  themselves  as  modulations 
on  these  carriers  (as  can  be  seen  in  Figure  la). 


A.  Significance  of  synchrony  capture 

Synchrony  capture  may  have  implications  for  neural  representations  of  periodicity  and 
spectrum,  as  well  as  for  F0-based  sound  separation  and  grouping.  Synchrony  capture  in  the 
auditory  nerve  permits  representation  of  relative  intensity  that  is  level- invariant,  and  thus 
is  useful  for  representing  the  normalized  power  spectrum  in  a  robust  manner.  The  num- 
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bers  of  fibers  locking  onto  particular  frequency  components  give  indications  of  the  relative 
intensities  of  the  corresponding  components.  This  is  a  robust  means  of  encoding  their  rel¬ 
ative  magnitudes  using  neural  elements  with  limited  dynamic  ranges.  The  proposed  SCFB 
algorithm8  attempts  to  emulate  this  behavior  using  adaptive  filters  to  create  a  competition 
for  channels  amongst  frequency  components  that  not  only  accurately  reflects  their  relative 
magnitudes,  but  is  also  invariant  with  respect  to  absolute  signal  amplitude. 

This  signal  processing  strategy  for  encoding  relative  intensities  has  relevance  for  audi¬ 
tory  nerve  representations.  Global  temporal  representations  of  lower-frequency  sounds  in  the 
auditory  nerve,  called  population- interval  distributions  or  summary  autocorrelations,  implic¬ 
itly  utilize  such  principles  to  represent  pitch  and  timbre  (e.g.  vowel  formant  structure) 7,9-11 . 
The  most  direct  signal  processing  analogues  of  these  global  temporal  auditory  nerve  models 
are  the  ensemble  interval  histograms  (ElHs)12.  Essentially,  dominant  frequency  components 
below  5  kHz  that  are  present  at  any  given  instant  partition  the  cochlear  CF  territory  into 
swaths  of  auditory  nerve  fibers  (ANFs)  that  have  similar  temporal  discharge  patterns  (and 
hence  similar  interval  distributions).  In  the  context  of  global  population-interval  repre¬ 
sentations  that  sum  together  interspike  intervals  across  the  entire  auditory  nerve,  relative 
intensities  of  partials  are  conveyed  through  relative  numbers  of  all-order  interspike  intervals 
associated  with  their  respective  locally-dominant  components  rather  than  numbers  of  CF 
channels  recruited.  Whether  through  relative  numbers  of  pooled  intervals  or  of  similarly- 
responding  channels,  this  parcellation  of  the  cochlea  into  competing  synchronization  zones 
efficiently  utilizes  the  entire  auditory  nerve  for  signal  representation. 

Synchrony  capture  could  also  potentially  be  utilized  by  place-based  brainstem  auditory 
representations  that  analyze  excitation  boundaries  by  using  local  across-CF  comparisons 
of  temporal  firing  patterns13.  Here  the  abrupt  temporal  pattern  discontinuities  associated 
with  synchrony  capture  increase  contrast  and  the  precision  of  boundary  estimations  in  such 
coding  schemes. 

Further,  synchrony  capture  may  facilitate  F0-pitch  formation  and  sound  separation  by 
enhancing  temporal  representations  of  individual,  resolved  harmonics  at  the  expense  of  those 
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produced  by  interactions  of  multiple,  unresolved  harmonics.  Synchrony  capture  has  the 
effect  of  minimizing  periodicities  related  to  beatings  of  adjacent  harmonics,  as  can  be  seen 
in  the  lack  of  composite  interspike  interval  patterns  when  the  harmonics  are  well  separated 
(Figure  2b).  The  temporal  auditory  nerve  representation  of  a  harmonic  complex  with  low, 
well-separated  harmonics  thus  resembles  a  series  of  interspike  interval  patterns  each  of  which 
resembles  that  of  a  pure  tone  of  corresponding  frequency. 

The  enhancement  of  the  representation  of  individual  harmonics  in  turn  has  implications 
for  Fo-based  sound  separation.  Most  acoustic  signals  in  everyday  life  are  mixtures  of  sounds 
from  multiple  sources.  In  order  to  separate  multiple  concurrent  sounds,  human  listeners 
mainly  rely  on  differences  in  onset  times  and  fundamental  frequencies  ToS.  Results  of  psy¬ 
chophysical  experiments  suggest  that  separation  of  multiple  auditory  objects  with  different 
fundamentals,  such  as  those  produced  by  multiple  voices  or  musical  instruments,  crucially 
depends  on  the  presence  of  perceptually-resolved  harmonics  (n<5)14.  These  resolved  har¬ 
monics  dominate  in  pitch  perception  and  have  high  pitch  salience15. 

In  terms  of  interspike  interval  representations  of  individual  partials  (as  seen  in  Figure 
2),  the  effect  of  synchrony  capture  is  to  separate  the  interspike  interval  patterns  of  adjacent 
partials  if  they  are  separated  by  more  than  some  threshold  ratio,  or  to  fuse  them  together 
if  they  are  not.  It  is  therefore  not  unreasonable  to  hypothesize  that  the  synchrony  cap¬ 
ture  process  might  play  a  role  in  whether  adjacent  partials  are  fused  together  or  separated 
perceptually.  For  frequencies  for  which  there  is  significant  phase-locking,  synchrony  cap¬ 
ture  behavior  thus  qualitatively  parallels  tonal  separations  and  fusions  that  are  associated 
with  harmonic  resolution  and  critical  bands.  These  parallels  notwithstanding,  the  size  of 
psychophysically-measured  critical  bandwidths  in  cats,  roughly  twice  those  of  humans,  cast 
some  doubt  on  a  simple,  direct  correspondence16. 

The  mechanism  in  the  auditory  pathway  whereby  the  harmonically-related  components 
of  each  of  two  concurrent  harmonic  complexes  fuse  together  to  produce  two  To-pitches  at 
their  respective  fundamentals  is  not  yet  understood.  The  two  F0-pitches  can  be  heard  out, 
even  if  the  harmonics  of  the  two  complexes  are  interleaved,  provided  that  the  unrelated,  ad- 


jacent  harmonics  are  sufficiently  separated  in  frequency.  In  this  context,  synchrony  capture 
minimizes  temporal  patterns  associated  with  interactions  between  adjacent,  harmonically- 
unrelated  partials,  thus  eliminating  interaction  products  that  might  otherwise  degrade  the 
representations  of  the  individual  harmonics  and  hinder  their  grouping  and  separation  on  the 
basis  of  shared  interspike  intervals. 

For  the  above  reasons,  it  seems  reasonable  to  emulate  synchrony  capture  in  a  signal 
processing  algorithm. 


B.  Design  rationale  for  the  SCFB  algorithm 

Although  the  explicit  goal  of  the  SCFB  is  to  emulate  synchrony  capture  in  the  auditory 
nerve  and  not  to  model  cochlear  biophysics,  because  its  signal  processing  design  was  partially 
inspired  by  cochlear  structure,  some  discussion  of  the  latter  is  useful  in  understanding  the 
former.  A  schematic  of  the  proposed  SCFB  algorithm  is  shown  in  Figure  3a.  It  consists  of  a 
bank  of  K  fixed,  relatively  broad  filters  in  cascade  with  tunable,  narrower  filters  that  produce 
the  synchrony  capture  behavior.  This  nesting  of  broad  and  narrow  filters  is  not  unlike 
coarse  and  fine  gradations  in  a  vernier  scale.  Tuning  of  the  adaptive  filters  is  carried  out  via 
frequency  discriminator  loops  (FDLs)  on  time  scales  of  milliseconds  to  tens  of  milliseconds, 
making  real-time  frequency  tracking  possible. 

In  any  attempt  to  reverse-engineer  biological  auditory  functions,  it  is  useful  to  con¬ 
sider  artificial  systems  that  exhibit  behaviors  not  unlike  their  natural  counterparts.  The 
phenomenon  of  synchrony  capture  appears  similar  to  the  well  known  “frequency  capture” 
behavior  of  traditional  FM  receivers  such  as  FM  discriminators  and  phase  lock  loops.  Fre¬ 
quency  capture1'  occurs  when  an  FM  receiver  locks  on  to  a  strong  FM  signal  even  in  the 
presence  of  other  interfering,  relatively  weaker  FM  signals.  One  such  FM  receiver  circuit 
is  a  frequency  discriminator 18 (p. 206),  which  uses  stagger-tuned  bandpass  filters  whose  out¬ 
put  envelopes  are  differenced  to  obtain  the  demodulated  baseband  signal.  Such  circuits  are 
known  to  exhibit  frequency  capture.  The  signal  processing  architecture  proposed  here  was 
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Gamma  tone  Filter  Bank  (a) 

Frequency  Discriminator  Loop 


(b) 
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FIG.  3.  Synchrony  capture  filterbank  (SCFB).  (a)  The  hlterbank  architecture  consists  of  K 
constant-Q  gammatone  filters  whose  logarithmically-spaced  center  frequencies  span  the  de¬ 
sired  audible  frequency  range.  Each  hlterbank  channel  consists  of  a  frequency  discriminator 
loop  (FDL)  cascaded  with  each  of  the  K  gammatone  filters.  The  output  of  each  channel, 
yc(t),  is  obtained  from  its  center  filter.  See  sections  II  and  III  for  details.  Frequency  re¬ 
sponses  of  fixed  and  tunable  filters  in  the  SCFB.  Bottom  left  panel  (b)  shows  the  frequency 
responses  of  fixed  gammatone  filters  (the  black  dots  indicate  that  not  all  filter  responses  are 
shown).  Bottom  right  panel  (c)  shows  the  Frequency  responses  of  the  tunable  bandpass  filter 
(BPF)  triplets  that  adapt  to  the  incoming  signal.  One  BPF  triplet  is  associated  with  each 
fixed  filter,  such  that  coarse  filtering  of  the  fixed  gammatone  filters  is  followed  by  additional, 
finer  filtering  by  tunable  filters.  The  nested  arrays  of  fixed,  coarse  and  adjustable,  fine  filters 
are  arranged  in  a  manner  similar  to  a  vernier  scale. 


designed  with  both  these  circuits  and  possible  cochlear  analogues  in  mind. 

In  the  SCFB  architecture,  the  fixed  gammatone  hlterbank  with  relatively  coarse  band¬ 
pass  tunings  (Q  =  4)  emulates  the  behavior  of  the  passive  basilar  membrane  whose  stiffness 
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decreases  monotonically  from  base  to  apex.  The  bandwidths  of  the  gammatone  filters  were 
chosen  to  approximate  cochlear  impulse  responses  and  tuning  characteristics  observed  for 
input  signals  at  high  sound  pressure  levels  and  are  thought  to  be  consequences  of  largely 
passive  mechanical  filtering19.  In  the  SCFB  architecture,  finer  frequency  tuning  is  achieved 
using  a  second  layer  of  narrower  bandpass  filters  (BPFs,  Q=8)  that  emulate  the  filtering 
functions  of  outer  hair  cells  (OHCs).  In  the  cochlea,  while  inner  hair  cells  (IHCs)  are  thought 
to  be  relatively  passive  mechanoelectrical  transducers,  outer  hair  cells  also  have  active  elec¬ 
tromechanical  processes  that  permit  them  to  change  length  under  the  influence  of  their 
transduction  currents,  thereby  amplifying  local  mechanical  vibrations20. 

The  proposed  adaptive  bandpass  filter  (BPF)  triplets  that  form  the  heart  of  the  fre¬ 
quency  discriminator  loop  (FDL)  consist  of  three  relatively  narrowly  tuned  filters  with 
slightly  offset  center  frequencies  that  are  in  cascade  with  each  fixed  filter  of  the  passive 
gammatone  hlterbank.  This  arrangement  contrasts  with  the  situation  in  the  cochlea,  where 
OHCs  with  their  active  processes  and  narrower  tunings  are  in  bidirectional  interaction  with 
the  more  broadly  tuned  motions  of  the  basilar  membrane19.  The  BPF  triplets  are  locally 
adaptive  and  are  tuned  based  on  differences  in  amplitudes  of  signals  output  by  the  filters 
in  the  triplet.  Although  broadly  similar  designs  were  available  in  the  adaptive  filtering 
literature21,22,  independent  of  auditory  modeling,  it  was  the  spatial  arrangement  of  outer 
hair  cells  (OHCs)  observed  in  mammalian  cochleae23  that  inspired  this  particular  triplet  de¬ 
sign.  The  lateral  amplitude  differencing  process  in  each  BPF  triplet  amounts  to  taking  the 
spatial  derivative  of  the  local  amplitude  spectrum  at  that  particular  cochlear  location.  Such 
lateral  differencing  processes  could  conceivably  be  carried  out  over  time  spans  of  up  to  tens 
of  milliseconds  via  lateral  interactions  in  intracochlear  and  olivocochlear  neural  networks24 
(p.15,  Fig.  1.13  (A)),25,26(p.289,  Fig.ll). 

The  tuned,  oscillatory  motility  of  outer  hair  cells  inspired  use  of  a  voltage-controlled 
oscillator  (VCO)  to  tune  the  filter  triplets.  Feedback  control  of  triplet  tuning  could  also 
be  potentially  implemented  via  other  signal  processing  mechanisms.  The  action  of  hair  cell 
stereocilia  that  open  ion  channels  preferentially  in  one  direction  suggests  half-wave  rectih- 
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cation  of  the  signal,  an  operation  similar  to  envelope  detection  that  is  already  commonly 
used  in  auditory  modeling.  The  nonlinear  response  characteristics  of  hair  cells  inspired  the 
logarithmic  compression  of  the  envelope  (see  section  II. B)  that  is  used  by  the  frequency  dis¬ 
criminator  loop  to  capture  dominant  signals  and  suppress  weaker  ones.  All  of  these  design 
features  stem  from  the  general  idea  that  many  aspects  of  cochlear  function  and  auditory 
nerve  behavior  can  be  emulated  by  frequency  tracking  circuits. 


C.  Organization  of  the  paper 

This  paper  first  describes  the  operation  of  components  of  the  adaptive  filters,  followed 
by  the  architecture  of  the  SCFB  as  a  whole.  In  section  II,  FDLs  and  their  use  as  basic 
tone  followers  are  presented.  As  mentioned  earlier,  each  FDL  is  made  up  of  three  tunable 
bandpass  filters  (called  a  BPF  triplet).  Tuning  of  the  triplet  filters  is  effected  using  voltage 
controlled  oscillators  (VCOs).  In  section  II. A  a  simple  tone  follower  (STF)  consisting  of  a 
BPF  triplet  and  a  VCO  is  described  that  is  capable  of  tracking  the  frequency  of  a  tone.  The 
linear  equivalent  circuit  of  the  tone  follower  is  presented,  which  is  useful  in  choosing  the 
loop  filter  parameters  of  the  FDL.  The  dominant  tone  follower  (DTF)  is  then  developed  in 
section  II. B.  The  DTF  uses  a  simple  nonlinearity  in  the  feedback  loop  of  the  FDL  to  lock 
on  to  the  dominant  tone  when  the  input  consists  of  more  than  one  tone.  In  other  words,  the 
DTF  is  capable  of  synchrony  capture.  In  section  II. C  a  practical  implementation  of  the  BPF 
triplet  is  presented  that  has  several  desirable  characteristics  for  signal  processing  purposes, 
such  as  linear  phase,  perfect  even  and  odd  symmetry  and  a  single  VCO  operation. 

In  section  III  a  traditional  fixed  gammatone  hlterbank  is  combined  in  cascade  with  a 
bank  of  FDLs  to  form  the  synchrony  capture  hlterbank  (SCFB).  Responses  of  the  hlterbank 
to  harmonic  tone  complexes,  isolated  vowels,  and  running  speech  are  presented  in  section 
IV.  Correspondences  with  cochlear  response  characteristics  and  auditory  nerve  behavior  are 
discussed  in  section  V.  The  section  V  also  includes  relationships  of  the  proposed  algorithm 
to  previous  research. 
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II.  TONE  FOLLOWERS  AND  FREQUENCY  CAPTURE 


Frequency  discriminator  loops  (FDLs)  have  been  used  for  synchronizing  transmitter  and 
receiver  oscillators  in  digital  and  analog  communication  systems  for  decades27,28.  Typically, 
in  a  communication  receiver,  an  FDL  brings  the  receiver  oscillator  frequency  close  to  the 
transmitter  frequency,  i.e.,  within  the  lock-in  range  of  a  phase  lock  loop,  such  that  it  can 
lock  the  two  oscillators29.  The  structure  of  the  frequency  tracking  algorithms  used  here, 
called  tone  followers,  are  similar  to  the  FDLs  used  in  communication  systems.  The  block 
diagram  of  a  generic  FDL  is  shown  in  Figure  4.  It  consists  of  a  frequency  error  detector 
(FED),  a  loop  filter  and  a  voltage  controlled  oscillator  (VCO).  The  FED  outputs  an  error 
signal  e(t)  that  is  proportional  to  the  difference  between  the  frequency  of  the  input  signal 
uj\  and  the  frequency  of  the  VCO  output,  u>c.  The  loop  filter  provides  the  control  voltage 
to  the  VCO  and  drives  its  frequency  such  that  ojc  —  U\  tends  to  zero.  Typically,  the  system 
function  F(s)  of  the  the  loop  filter  determines  its  dynamics  and  has  the  form  kp  +  ki/s  where 
kp  and  /q  are  the  proportional  and  integral  gain  factors30,  respectively  (more  details  below 
in  Section  II. A). 

Section  II. A,  describes  how  an  FDL  is  used  as  a  simple  tone  follower  (STF)  and  defines 
its  components.  A  linear  equivalent  circuit  of  the  FDL  is  also  provided.  In  most  realistic 
sound  processing  contexts  one  encounters  multiple  sinusoidal  signals  (as  in  a  voiced  speech 
formant).  In  section  II. B,  a  dominant  tone  follower  (DTF)  is  described  that  is  capable  of 
following  a  dominant  tone  in  the  presence  of  other  interfering  weaker  tones  and  exhibits 
synchrony  capture.  This  is  realized  by  using  a  compressive  nonlinearity  in  the  feedback 
path.  The  linear  equivalent  circuit  for  DTF  is  essentially  identical  to  that  of  the  STF. 

A.  A  simple  tone  follower  (STF)22 

The  frequency  discriminator  loop  (FDL)  (Figure  4)  tracks  the  frequency  of  an  input  tone 
by  using  a  frequency  error  detector  (FED)  that  steers  the  center  frequencies  of  the  VCOs 
of  the  triplet  adaptive  filters  (Figure  5).  Another  type  of  FED  is  described  in  Appendix  A. 
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FIG.  4.  A  generic  frequency  discriminator  loop  (FDL).  The  error  signal  e(t)  is  a  measure 
of  the  frequency  difference  between  the  input  signal  and  the  VCO.  See  Figures  5  and  8  for 
details  of  specific  frequency  error  detectors. 


In  principle,  the  FED  consists  of  three  identically  shaped  tunable  band  pass  filters  (BPFs), 
Hr{u>),  Hc(co)  and  initially  centered  around  frequencies  c oc  +  A,  uic  and  u>c  —  A,  re¬ 

spectively.  The  subscripts  R,  C  and  L  stand  for  the  right,  center  and  left  filters,  respectively. 
As  uc,  the  frequency  of  the  VCO  (in  Figure  4)  is  changed,  the  center  frequencies  of  the  BPFs’ 
also  change  accordingly,  such  that  these  filters’  response  functions  slide  along  the  frequency 
axis.  The  spacing  between  triplet  filters  (A)  is  fixed.  Only  the  left  and  right  filters  are  used 
in  calculating  the  error  signal  e(t).  The  envelope  detectors  compute  the  (squared)  envelope 
of  the  BPFs’  outputs.  When  a  tone,  A±  cos(coit  +  6b)  is  presented  to  the  FED,  the  average 
values  of  the  (squared)  envelopes  for  right  and  the  left  filters  are  e#(t)  =  \AiHr{u>i)Y  and 
e^f)  =  \AiH l(u>i)\2 ,  respectively.  (If  the  input  tone’s  frequency  changes  with  time  then  eR 
and  eL  are  also  functions  of  time  t .)  Then  the  error  signal  e(t)  is  computed  as  the  ratio  of 
the  difference  of  the  envelopes  (eR(t)  —  ex,(f))  to  their  sum  (eR(t)  +  e  Lit)). 

Note  that  the  ratio  eliminates  the  amplitude  of  the  input  signal  A\  from  e(t),  and  now 
e(t)  is  just  related  to  the  frequency  error  u>c  —  uj\.  Instead  of  computing  the  ratio,  an  AGC 
circuit  at  the  input  could  have  been  used  to  normalize  the  amplitude.  The  principle  is  to 
move  the  frequency  responses  of  the  BPFs  Hr(oj)  and  Hl(u>)  (and  Hc(u>))  in  tandem,  under 
the  control  of  the  VCO  frequency  uic,  such  that  when  the  error  e(t)  =  0,  u>c  equals  uq.  So, 
the  VCO  tracks  the  input  frequency. 
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(a) 


Frequency  Error  Detector  (FED) 


(b)  (c) 


Cos(uct)  Cos(coct) 


FIG.  5.  Frequency  error  detector  (FED)  used  in  the  simple  tone  follower  (STF).  Error 
signal  e(t)  is  computed  using  the  formula  The  envelopes  e^t),  e^(f),  and  ec(t), 

are  obtained  as  I2  +  Q2.  The  /  and  Q  for  center  filter  Hc(co),  are  the  outputs  of  the  LPFs 
shown  in  (b).  and  Hr(uj)  have  the  same  structure  but  with  oscillator  frequencies  at 

ujc  —  A  and  ujc  +  A  respectively.  The  discriminator  transfer  characteristics  S(oj)  (thick  line) 
and  magnitude  responses  of  left  and  right  filters  (thin  lines)  are  shown  in  (c). 


The  frequency  discriminator  function  S(u)  = 


Hr(u) 


I  Hr(u) 


\Hr(uj)\2  +  \HlUS) 


(also  called  the  “S- 


curve”29),  is  shown  in  Figure  5c.  When  a  tone  A1  cos(o;it  +  di)  is  applied  as  the  input,  then 
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e(t)  =  S(coi).  In  the  interval  u>c  —  A  <  u>  <  uc  +  A  the  error  voltage  e(t)  is  approximately 
linear,  so  e(t)  ~  ks(ujc  —  coi).  ks  is  called  the  frequency  discriminator  constant29. 

The  tunable  BPFs  are  built  using  the  filter  structure  shown  in  Figure  5b  (called  “cos- 
cos”  structure),  which  shows  how  Hc(co)  (centered  at  ujc)  is  realized  using  two  lowpass 
filters  (LPFs).  Identical  LPFs  with  frequency  response  H(u )  are  sandwiched  between  two 
multipliers  in  both  the  lower  and  upper  branches  of  the  circuit.  Both  the  multipliers  in  the 
upper  branch  are  supplied  with  cost oct  (hence  the  name  cos-cos  structure)  and  the  lower 
branch  are  supplied  with  a  sintUct  from  the  same  VCO  with  frequency  uc.  It  can  be  easily 
shown  that, 

Hci1^)  —  H{u  +  ujc)  +  H(u  —  ujc).  (1) 

Similarly,  the  BPF  Hl(uj)  (or  Hr(oj ))  is  implemented  as  a  cos-cos  structure  with  the  same 
LPF  filters  but  with  the  VCO  frequency  at  ojc  —  A  (or  ojc  +  A).  Together  the  three  filters 
shown  inside  the  FED  box  in  Figure  5a  is  called  a  BPF  triplet.  The  frequency  spacing 
between  these  filters,  A,  is  kept  fixed.  Only  the  left  and  right  filters  are  used  in  calculating 
the  error  signal  e(t). 

The  center  filter  envelope  is  used  to  declare  a  “track”  condition,  i.e.  that  the  filter  has 
converged  on  a  tonal  input.  When  this  convergence  occurs  at  the  input  tone  frequency  uq, 
then  the  envelope  of  the  center  filter  output  ec(t )  will  satisfy  the  following  condition, 

=  eR(t)  =  nec{t)  (2) 

for  some  constant  /j.  If  the  filter  shapes  are  chosen  such  that  \Hn(ujc)\  =  \HL(u>c)\  = 
0.707|F/c,(V;c)|  (i-e.,  3-dB  points  of  the  right  and  left  filter  coincide  with  the  center  frequency 
of  the  center  filter),  then  //  =  0.5.  If  the  above  condition  is  satisfied,  then  the  input  is  a  tone 
whose  frequency  coincides  with  the  VCO  frequency  c uc,  and  a  “track”  condition  is  declared. 
Such  channel  outputs  can  be  used  to  compute  the  pitch  frequency  of  a  complex  tone.  This 
FED  structure  requires  three  VCOs  operating  at  wc  —  A,  uic  and  tac  + A  to  realize  the 
Hc(co),  and  Hr(lo)  respectively. 

An  approximate  linear  equivalent  circuit  of  the  frequency  discriminator  loop  can  provide 
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(a) 


Frequency  in  Hz  --> 

(b)  (c) 


FIG.  6.  Convergence  of  a  BPF  triplet  on  an  input  tone  at  uj\.  (a)  Frequency  responses  of 
BPF  triplet  filters  in  relation  to  an  input  tone.  The  input  tone  frequency  is  =  27tx950 
Hz.  Initially  the  L,  C,  and  R  filters  are  centered  at  cuc  —  A  =  27tx859  Hz,  a ;c  =  27tx901  Hz 
and  ojc  +  A  =  27tx943  Hz,  respectively.  Since  initially  oj\  >  c jc,  the  initial  envelope  output 
en(t)  is  greater  than  e^t),  so  the  normalized  error  e{t)  is  positive.  This  positive  value  of  e{t) 
causes  the  VCO  frequency  ojc  to  increase  until  ujc  equals  u\.  (b)  Time  course  of  envelopes 
en(t)5  ec(t)  and  e^(t).  Note  that  the  envelopes  eu(t)  and  e^(t)  become  equal  after  some 
settling  time  and  that  ecif)  reaches  a  higher  plateau,  where  ei(t)=e/j(t)=0.5ec(t).  (c)  VCO 
frequency  track  for  the  C  filter. 


17 


some  insight  into  the  behavior  of  the  tone  follower  (Figure  7).  Here  the  input  tone  and  the 
oscillator  output  are  replaced  by  their  frequency  values  ay  and  ay,  respectively.  Recall  that 
the  frequency  error  detector  (FED)  outputs  a  voltage  level  proportional  to  the  frequency 
difference  ay  —  ay.  Therefore,  the  FED  in  Figure  5a  is  modeled  by  a  proportionality  constant 
ks.  Assuming  that  we  operate  the  discriminator  loop  in  the  region  ay  —  A  <  u  <  ay  +  A, 
this  constant  ks  is  the  gain  factor  representing  the  slope  of  the  S-curve  shown  in  Figure  5c. 
Assuming  that  the  sandwiched  LPF  in  Figure  5b  has  a  system  function  l/(s  +  a),  where 
a  represents  its  3-dB  bandwidth,  it  can  be  shown  that  the  frequency  error  discriminator 
constant  ks  is  equal  to  2A/(A2  +  a2)  (see  Appendix  B).  In  addition,  note  that  the  calculation 
of  the  envelopes  needed  to  estimate  the  frequency  difference  entails  a  group  delay  rg.  This 
time  delay  is  represented  by  its  Laplace  transform  e_ST®  in  Figure  7.  At  low  frequencies 
the  BPF  Liters  are  narrower,  and  hence  rg  is  relatively  large.  At  high  frequencies  Tg  ~  0. 
In  Figure  7,  e~STa  is  approximated  (using  Pade  approximation31)  by  a  ratio  of  first  order 
s-polynomials, 


e 


1  —  ys 
1  +  ys 


(3) 


where  y  =  rg/ 2.  The  controller  is  a  loop  Liter  whose  transfer  function  is  F(s)  =  kp  +  ki/s 
where  kp  is  the  proportional  constant  and  kn  is  the  integral  constant  (30,  page  254). 


FIG.  7.  Linearized  model  of  the  frequency  discriminator  loop. 
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Then,  the  closed  loop  transfer  function  H(s)  of  the  linearized  model  is 


H(s)  =  B(s)/A(s) 
1  —  7  s  . 


1  +  'ys 


hi 


ks  \K  +  — 


i  + 


1  —  7s . 

- - -I 

1  +  7s 


ks  (  K  +  — 


(4) 

(5) 


After  some  simplification  we  find  that  the  denominator  polynomial  A(s),  which  determines 
the  settling  time  rs  of  the  loop,  is  given  by  the  following  expression, 

k,ks 


A(s)  =  s  + 


2  !  (IT  kskp  -  7  kski) 


s  + 


(7  -  7 kgkp)  (7  -  7 kskp) 

Using  Routh’s  Stability  Criterion,  the  conditions  for  stability  are  given  by 


(7  -  7 kskp)  >  0  ^  kp  <  — 

(Co 


(6) 


(1  +  kskp  -  ~fkski)  >  0  ~tkr  -  kp  <  — 

rCs 

kiks  >  0  =>■  ki  >  0,  (ks  is  positive) 


We  need  to  find  kp  and  k{  such  that  the  step  response  has  a  desirable  settling  time.  This  is 
done  using  the  standard  pole  positioning  method  (30,  page  233)  based  on  Bessel  polynomials. 
For  a  second  order  system  with  a  normalized  settling  time  of  1  second,  the  Bessel  roots  of 
the  closed  loop  system  are  at  — 4.05±j2.34.  And  for  a  desired  settling  time  of  rs  seconds,  the 
roots  are  scaled  by  rs,  i.e.,  (—4.05  ±  j’2.34)/ts.  ffence  the  corresponding  Bessel  polynomial 
is  s2  +  (8.11/77)3  +  21.90/rf.  By  comparing  this  polynomial  with  the  A(s)  in  Eq.  6  ,  we  can 
write  the  following  two  linear  equations  in  terms  of  kp  and  k p 

aiki  +  b\kp  —  Ci 
fl'2 ki  -\-  b 2k p  =  C2 

where 

Oi  =  Tsjks  61  =  -ks  (ts  +  8.II7)  Ci  =  (ts  -  8.II7) 

a2  =  Tgks  b2  =  21.907 ks  c2  =  21.907 
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Solving  for  kp  and  ki  obtains 


h 


1/3-1 

ks  /3  +  1  ’ 

(21.90 

K  V 


2 

J+l’ 


(7) 


where  /3  =  8.11  (^j  +  21.90  . 

An  example  of  the  operation  and  convergence  dynamics  of  a  simple  tone  follower  (STF) 
in  response  to  a  pure  tone  nearby  in  frequency  is  illustrated  in  Figure  6,  and  described  in 
the  caption.  The  step  response  of  the  linear  equivalent  circuit  (step  size  is  950  —  901  =  49 
Hz)  coincides  almost  exactly  with  that  of  the  frequency  track  shown  in  Figure  6c. 


B.  Dominant  tone  follower  (DTF) 


The  simple  tone  follower  (STF)  is  suitable  for  tracking  one  tone,  but  in  real  world 
acoustic  environments,  pure  tonal  signals  are  only  rarely  encountered.  Instead,  the  vast 
majority  of  signals  are  mixtures  of  complex  sounds  from  multiple  sources  that  can  contain 
nearby  partials  or  harmonics.  Here  a  dominant  tone  follower  (DTF)  is  needed  that  can 
track  the  frequency  of  a  dominant  partial  in  a  signal  even  in  the  presence  of  other  interfering 
ones,  similar  to  the  synchrony  capture  behavior  observed  in  the  auditory  nerve.  A  simple 
modification  of  the  STF  described  above  that  employs  a  nonlinearity  in  the  feedback  loop 
results  in  the  dominant  tone  follower  (DTF)  described  below. 

Consider  a  signal  x(t )  consisting  of  a  tone  at  frequency  oj\  =  27 r/j  and  an  interfering 
tone  at  u2  =  27t/2. 

x(t )  =  A\  cos(cui t  +  9\ )  +  A2  cos(u2t  +  02)  (8) 


Let  us  assume  that  Ai  >  A2,  i.e.,  the  tone  at  oji  is  dominant.  We  rewrite  x(t)  using  complex 
notation  as  follows. 


x(t)  =  ^{Aiej^lt+ei\  1  +  ^ejAa,t+iA0)} 

A\ 


(9) 


where  3ft  stands  for  “Real  part  of’,  Au  =  u2  —  uj\  and  A 9  —  92  —  9\,  and  j  =  yf— I.  Since 
A2/Ai  <  1,  (using  the  approximation  that  ey  ~  1  +  y  for  y  <  1,  in  the  above  expression)  we 
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have, 


where  the  envelope  is 


x(t)  ~  a(t )  cos (0(f)), 

a(t)  » 


(10) 

(11) 


and  the  phase  function  is 

Ao 

0(f)  ~  uqf  +  #i  +  —  sin(Ao;f  +  Ad).  (12) 

A] 

The  derivative  of  0(f)  (i.e.,  the  instantaneous  frequency  (IF)18,  p.  180)  and  the  log-envelope 
are  as  follows: 

~  oji  +  — -Au  cos(Acaf  +  Ad),  (13) 

Cit 

A o 

logo(f)  «  log  Ai  +  —  cos(Acnf  +  Ad).  (14) 

2i  i 

The  symbol  log  denotes  natural  logarithm.  Note  that  the  average  value  of  IF  is  uq,  the  dom¬ 
inant  tone’s  frequency,  and  similarly,  the  average  value  of  the  log-envelope  is  the  dominant 
tone’s  log  amplitude.  Either  of  these  properties  can  be  utilized  for  frequency  discrimination 
purposes.  An  exact  expression  for  the  log-envelope  of  x(t)  can  also  be  obtained  as  follows: 


a2(f)  =  |  AiejuJlt+jdl  +  A2ejU2t+j02 12  =  A2  +  A2  +  2AXA2  cos(Ac ot  +  Ad).  (15) 


Taking  logarithm  and  using  the  infinite  series  expansion  for  log(l  +  x)  we  have 


^ — ,  1  /  Ao  \  n 

log  a(t)  =  log  Ax  +  cos(,nAujt  +  nAd). 


Note  that  Eq.  14  retains  only  the  first  term  in  the  infinte  sum  above.  Also  note  that  the 
average  value  of  loga(f)  is  logAi.  On  the  other  hand,  the  average  value  of  the  squared 
envelope  a2(f)  is  (Af  +  A\). 

A  frequency  discriminator  can  lock  on  to  by  filtering  the  instantaneous  frequency 
(IF,  assuming  that  it  is  available)  using  a  low-pass  filter  (LPF)  with  a  cut  off  frequency  Au. 
Alternatively,  the  log-envelope  can  also  be  used  to  capture  the  dominant  signal  (Figure  8). 
In  an  FDL  the  logarithmically  compressed  envelope  signal,  log  o(f),  can  be  low  pass  filtered 
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Frequency  Error  Detector  (FED) 


FIG.  8.  Frequency  error  detector  (FED)  for  the  dominant  tone  follower  (DTF).  The  error 
signal  e(t)  is  computed  using  the  formula  log  ■ 

(with  the  same  cut  off  frequency,  Aw,  as  in  the  case  of  IF)  to  obtain  log  A±.  This  can  then 
be  used  to  lock  on  to  the  dominant  tone  in  the  input. 

Compared  to  the  simple  tone  follower,  note  that  the  envelopes  in  the  dominant  tone  fol¬ 
lower  are  now  compressed  using  a  logarithmic  nonlinearity  before  they  are  low  pass  filtered 
(by  the  loop  filter).  If  the  input  is  just  one  tone  (x(t)  =  A\  cos(uqi  +  6h))  then  the  corre¬ 
sponding  smoothed  squared  envelopes  at  the  outputs  of  the  right  (HR(co))  and  left  (HL(co)) 
filters  are  A\R  =  Al\HR(u>i)\2  and  A\L  =  A\\Hl{uji)\2  respectively.  So,  the  error  signal  is 
e[t)  =  2\og(AiR/ AiL) .  Note  that  e(t)  is  proportional  to  the  frequency  difference  uq  —  uc 
and  does  not  depend  on  the  amplitude  A\  (as  in  STF). 

Now,  consider  the  case  of  an  input  x(t)  with  two  tones  as  in  Eq.  8.  Then,  there  are 
two  cases.  In  the  first  case,  assume  that  the  same  tone  (either  at  aq  or  u q)  dominates 
both  (right  and  left)  filters’  outputs.  Then,  clearly  the  (average)  error  is  2  \og(A1R/ An)  or 
2  log (A2R/A<2l)  depending  on  which  tone  dominates.  Since  the  loop  tends  to  drive  this  error 
to  zero,  the  VCO  frequency  c uc  changes  such  that  the  left  and  right  filter’s  log-amplitudes  are 
equal.  Thus  cuc  tends  to  track  the  dominant  tone.  In  contrast,  if  the  nonlinearity  is  absent 
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then  the  left  and  the  right  filters  produce  (squared,  averaged)  envelopes  equal  to  A\L  +  A\L 
and  A\r  +  AlR,  which  result  in  ujc  settling  in  between  oj\  and  oj2i  he.,  no  capture.  Thus,  the 
compressive  non-linearity  helps  steer  the  VCO  to  the  dominant  signal’s  frequency. 

In  the  second  case,  if  the  tone  at  uj\  dominates  the  left  filter  output  and  the  tone  at  u>2 

dominates  the  right  filter  output,  then  the  error  e(t)  is  proportional  to  \og(A2R/ A1L)  and  the 

VCO  frequency  is  adjusted  by  the  loop  such  that  A2r  =  An-  That  is  cuc  averages  in  between 

(j0\  and  lo2.  In  summary,  if  one  tone  is  sufficiently  bigger  than  the  other,  then  capture  occurs, 

but  if  two  tones  are  close  in  frequency  and  have  equal  or  almost  equal  amplitudes,  then  the 

VCO  locks  on  to  a  weighted  average  frequency.  This  behavior  is  similar  to  that  seen  in  the 

auditory  nerve  (Figure  2b)  for  nearby  partials. 

The  linear  equivalent  circuit  for  the  DTF  is  essentially  identical  to  that  of  the  STF 

developed  in  section  II. A,  except  that  the  parameter  ks  is  slightly  different  ( ks  —  —r - -  ) 

\  A2  +  cr  ) 

(see  Appendix  B).  Figure  9  shows  an  example  of  a  DTF  homing  in  on  a  stronger  tone  in  the 
presence  of  a  nearby  weaker  tone  (vertical  arrows).  Such  dominant  tone  followers  are  used 
as  the  building  blocks  for  the  proposed  hlterbank  algorithm  described  below  in  section  III. 


C.  A  practical  implementation  of  the  frequency  discriminator  loop  (FDL) 

This  section  presents  the  design  of  an  FDL  which  incorporates  a  single  VCO  and  matched 
BPF  triplet  hlters.  This  implementation  of  the  BPF  triplet  (and  the  FDL)  that  requires  only 
one  VCO  has  several  advantages  over  those  described  above.  The  filters  that  form  the  BPF 
triplet  are  implemented  as  linear  phase  filters.  The  BPF  triplet  is  implemented  with  the  help 
of  odd/even  prototype  filters  such  that  they  result  in  perfectly  matched,  symmetrical,  left 
{Hl{u>))  and  right  ( Hr(u> ))  Liters.  That  is,  their  frequency  response  magnitudes  are  exactly 
equal  at  the  VCO’s  frequency  ojc.  Further,  the  computation  of  the  envelopes  e^(f)  and  e^(f) 
does  not  explicitly  require  in-phase  (I)  and  quadrature  phase  (Q)  signal  components.  Instead 
the  envelope  is  simply  obtained  by  taking  the  absolute  value  of  the  signal,  i.e.  ,  the  full- 
wave-  rectihed  output,  and  low-pass  Lltering  it.  The  three  bandpass  Liters  that  constitute 
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(a) 


(b) 


FIG.  9.  Behavior  of  a  DTF  in  response  to  two  nearby  tones  of  different  amplitude,  (a) 
Frequency  response  of  BPF  triplet  filters  and  the  input  tones  (vertical  arrows,  dominant 
tone  at  uj\  =  27tx950  Hz,  plus  a  half-amplitude  interfering  tone  at  uj\  =  27rxl050  Hz.  (b) 
Track  of  the  VCO  frequency  for  the  center  filter  C.  With  minor  fluctuations,  the  VCO  tracks 
the  stronger  950  Hz  tone  in-spite  of  the  weaker  1050  Hz  interferer. 

the  BPF  triplet  can  all  be  synthesized  from  a  single  prototype  noncausal,  low-pass  impulse 
response, 


h(t)  =  e-Q|t|,  (17) 

H(u)  =  2a/(u2  +  a2).  (18) 

Any  other  even  impulse  response  function  with  unimodal  low  pass  frequency  response  char¬ 
acteristics  (such  as,  h{t)  =  e~3^)  can  also  be  used  as  a  prototype  filter.  Let  h\{t)  and  h2(t) 
represent  the  impulse  responses  of  frequency  translated  Liters,  given  by 

tq(f)  =  e~alt]  cos  A f,  and  h2(t)  =  e"a|t|  sin  At,  (19) 
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where  A  is  the  translation  frequency.  So, 


H^co)  =  (H(u  -  A)  +  H(co  +  A))/2, 

H2{u>)  =j{H{u>-  A)  -H{u>  +  A))/2,  (20) 

where  j  =  \A-T.  A  is  chosen  equal  to  a,  so  that  A  is  the  3-dB  point  of  H{u).  The  frequency 
responses  H\  (uj)  and  H2(uj)  are  purely  real  and  imaginary,  respectively. 

Hi  (cu)  and  H2(uj)  are  embedded  as  part  of  the  tunable  band  pass  filters  Gi(oj)  and  G2(lu) 
shown  in  Figures  10a  and  10b,  respectively.  Gi(oj)  is  called  a  cos-cos  hlter  (same  structure 
as  Figure  5b)  and  G2(lu)  is  named  a  cos-sin  hlter. 

G\(uj)  =  (Hi(u  —  uc)  +  Hi(u  +  uc))/2, 

G2(cj)  =  j(H2(uj  —  ujc )  —  H-2(u  +  coc))/2.  (21) 

The  frequency  responses  Gi(oj)  and  G2{u )  are  both  real  and  even  and  are  shown  in  Figure 
10c.  These  frequency  responses  can  be  tuned  by  changing  ojc. 

Assume  for  the  moment,  that  the  systems  H\(uj)  and  H2(oj)  sandwiched  between  the 
multipliers  are  identical.  Then,  note  that  the  system  functions  of  a  generic  cos-cos  structure, 
Gi(oj),  and  cos-sin  structure,  G2(lu),  are  related  by  the  expression  G2(uj )  =  jsgn(u)Gi(uj) 
for  sufficiently  large  ay.  That  is,  cos-sin  structure  has  an  additional  term  which  signifies 
a  Hilbert  transform  when  compared  to  cos-cos  structure.  This  stems  from  the  fact  that 
the  multipliers  in  the  upper/lower  branches  of  Figure  10b  are  cosine  and  sine  unlike  the 
cos-cos  hlter  in  Figure  10a.  This  is  a  seemingly  new  way  of  realizing  a  band-pass  Hilbert 
transformer.  The  outputs  of  the  cos-cos  and  cos-sin  filters  are  then  added/subtracted  (see 
Figure  11)  to  obtain  the  overall  right/left  hlter  responses  Hr(lj)  and  (Figure  lOd), 

respectively.  That  is, 

Hr(u)  =  Gri(ca)  —  G2(lS) ,  and  =  Gi(u)  +  G2(lS).  (22) 

Substituting  for  G\{uj)  and  G2(uj)  in  Eq.  22  from  Eq.  21,  we  have, 

Hr(u)  —  ujc )  +  H\{u  +  cuc))/2  +  j(H2(uj  —  c uc)  —  H2(u  —  tac))/2, 

Hl{w)  —{Hi{u  —  uic )  +  Hi{u  +  u>c))/2  —  j(H2{u  —  cuc)  —  H2{lo  —  cac))/2.  (23) 
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(a)  (b) 

Cos(aJct)  Cos(a)ct)  Cos(o)ct)  Sin(a)ct) 


(C)  (d) 


FIG.  10.  (a)  Tunable  cos-cos  filter,  (b)  cos-sin  filter,  (c)  Frequency  responses  Gi(uj)  and 
£*2(0;)  (without  the  scale  factor  j )  are  shown,  (d)  Frequency  responses  of  the  right  and  left 
filters,  Hr(lj )  and  obtained  as  sum  and  difference  of  G\(uj)  and  G2(co)  (Figure  11). 

The  filters  Hr(lj)  and  Hl{uj)  are  basically  synthesized  from  a  single  prototype  H(u>),  and 
hence  are  perfectly  matched  and  symmetric  about  ojc.  The  frequency  response  of  Hc(co), 
not  shown,  is  centered  around  ojc.  All  filters  are  linear  phase  filters. 

Further  substituting  for  Hi  (uj)  and  H2(co)  in  Eq.  23  from  Eq.  20  and  simplifying,  we  have 

H ] j (tu)  —  H (cj  —  cuc  —  A)  +  H (cj  T  coc  T  A) ) 

=  H(oj  —  cuc  +  A)  +  H(u  +  ojc  —  A)).  (24) 

Thus,  the  hlters  Hr{uj )  and  HL(u )  (shown  in  Figure  lOd)  are  the  original  prototype  filter 
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]  envelope 
detector 


HrM 


FIG.  11.  Implementation  of  the  frequency  error  detector  and  the  frequency  discriminator 
loop.  The  center  filter  (not  shown)  is  implemented  using  a  cos-cos  filter  structure 

with  H(oj)  sandwiched  between  the  multipliers  as  in  Figure  5b. 

H{uj)  shifted  to  center  frequencies  ojc  +  A  and  c oc  —  A,  respectively.  They  have  purely  real 
valued  frequency  responses  (except  for  the  linear  phase  introduced  by  requiring  a  causal 
impulse  response)  and  are  the  ones  used  in  frequency  error  detection.  In  practice,  the  filter 
impulse  responses  in  Eq.  19  are  symmetrically  truncated  and  Hann  windowed  about  the  time 
origin  and  made  causal  by  shifting  them  to  the  right  resulting  in  linear  phase  filters.  The 
center  filter  Hc(u> )  (also  tunable)  centered  around  uic,  (shown  in  Figure  5b)  is  synthesized 
using  the  cos-cos  structure,  but  with  the  prototype  filter  H(oj)  sandwiched  between  the 
multipliers.  Its  output  is  not  used  in  error  signal  calculation  but  is  the  channel  output.  If 
the  input  tone  frequency  cui  is  less  than  the  VCO  frequency  uic  then  the  envelope  at  the 
output  of  Hl(co)  is  larger  than  the  envelope  at  the  output  of  Hr(uj )  and  the  error  signal 
will  drive  the  VCO  to  make  ojc  equal  to  uq  and  vice  versa.  The  loop  Elter  F(s)  determines 
the  dynamics.  The  linear  equivalent  circuit  described  in  section  II. A  is  applicable  to  this 
implementation  as  well.  The  envelope  detector  shown  in  Figure  11  is  a  rectifier  in  cascade 
with  a  LPF.  The  logarithmic  nonlinearity  serves  the  same  purpose  as  in  DTF.  This  LPF 
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increases  the  time  delay  r9  around  the  loop  and  has  to  be  included  while  calculating  the 
loop  filter  constants  kp  and  hi. 

III.  SYNCHRONY  CAPTURE  FILTERBANK  (SCFB) 

The  proposed  synchrony  capture  hlterbank  (SCFB)  shown  in  Figure  3a  consists  of  a 
bank  of  fixed  filters  each  cascaded  with  a  frequency  discriminator  loop  (FDL).  The  hlter¬ 
bank  consists  of  K  logarithmically  spaced  gammatone  filters  that  have  been  widely  used  in 
auditory  system  modeling32.  Using  physiologically-appropriate  filter  parameters  (approx¬ 
imately  constant,  low  Q  filters),  gammatone  hlterbanks  effectively  replicate  the  broadly 
tuned  mechanical  filtering  characteristics  of  the  basilar  membrane  in  the  cochlea. 

The  gammatone  filters  used  here  were  designed  using  the  Auditory  Toolbox  developed  by 
Malcolm  Slaney32,  and  further  details  of  the  cochlear  model  implementation  are  discussed 
in33.  In  our  implementation  K  is  200.  The  constant-Q  gammatone  filters  use  a  mix  of 
“Glasberg  and  Moore”  and  “Lyon”  parameters  spanning  center  frequencies  from  100-3940 
Hz,  with  corresponding  3-db  bandwidths  ranging  from  50  Hz  to  905  Hz.  Filter  Q  values 
(EarQ  parameter)  are  all  4,  and  the  order  parameter  is  l33.  The  minBW  used  in  computing 
the  equivalent  rectangular  bandwidth  (ERB)  is  50  Hz.  The  sampling  frequency  is  16000 
Hz.  An  example  of  the  frequency  responses  of  one  of  the  fixed  Liters  and  the  associated 
three  tunable  Liters  of  the  SCFB  are  shown  in  Figure  12.  Whereas  the  broadly  tuned,  Lxed 
gammatone  Liters  coarsely  isolate  the  various  frequency  components  in  the  incoming  signal, 
the  tunings  of  the  more  narrowly  tuned  bandpass  triplet  Liters  in  the  frequency  discriminator 
loops  (FDLs)  converge  on  the  precise  frequencies  of  the  individual  frequency  components. 

A.  Bandpass  filter  triplet  parameters 

As  mentioned  earlier  each  triplet  of  tunable  Liters  consists  of  left,  center,  and  right  Liters, 
Hc(co)  and  whose  center  frequencies  are  spaced  by  a  constant  ratio.  All  of 

them  are  derived  from  a  single  prototype  Liter  H{u)  deLned  in  Eq.  18,  whose  frequency 


Location  of  BPF  Triplet  filters  HL,HC,HR 


FIG.  12.  A  typical  BPF  Triplet  centered  at  1980  Hz.  The  broader  frequency  response 
corresponds  to  the  gammatone  filter  centered  around  1980Hz. 


response  is 


2  a 
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a 2  +  c u2 

The  parameter  a  is  chosen  to  be  equal  to  the  spacing  between  the  filters,  i.e. ,  a  =  A.  A  has 
been  chosen  to  be  one-fourth  of  the  bandwidth  (actually  halfwidth)  of  the  gammatone  filter. 
Hence  a  =  A  =  Bqt/ 4  determines  the  prototype  filter,  where  Bgt  stands  for  gammatone 
filter  bandwidth.  For  example,  Figure  12  shows  a  gammatone  filter  centered  around  1980 
Hz  with  bandwidth  of  466  Hz.  Individual  left,  center  and  right  triplet  filters  have  center  fre¬ 
quencies  1864,  1980,  and  2098  Hz  have  bandwidths  and  center  frequency  spacings  of  115  Hz. 
Bandwidths  and  spacings  of  fixed  gammatone  and  adaptive  triplet  filters  are  proportional 
to  center  frequency. 


B.  Frequency  discriminator  loop  filter  design  F(s) 

The  typical  loop  filter  used  in  our  implementation  is  of  the  form  F(s)  =  kp  +  ki/s.  The 
proportional  gain  kp  is  intended  to  improve  the  rise  time  of  the  step  response.  The  VCOs 
that  steer  the  tuning  of  the  triplet  filters  are  initially  set  to  match  the  center  frequency 
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ojc  of  their  corresponding  gammatone  filter.  Because  the  loop  is  initialized  with  the  VCO 
frequency  close  to  the  input  signal  frequency,  a  consequence  of  the  frequency  selectivity  of  the 
associated  gammatone  filter,  choosing  kp  =  0  does  not  affect  the  loop’s  rise  time  performance 
significantly  and  also  simplifies  its  implementation.  On  the  other  hand,  ki  is  needed  to  keep 
track  of  the  frequency  changes  in  the  input  and  drive  the  steady  state  error  to  zero.  The 
value  of  ki  depends  on  the  frequency  discriminator  constant,  ks ,  and  also  on  the  parameter 
Tg  that  represents  the  group  delay  of  the  prototype  filter  (i.e. ,  its  causal  approximation)  plus 
any  delay  introduced  (in  smoothing  the  envelope)  in  the  envelope  detector  in  Figure  11.  For 
each  channel,  the  following  values  were  used  for  the  loop  filter  parameters,  and  they  seem 
to  work  well  in  most  circumstances  (set  (3  =  1  in  Eq.  7): 


kp  —  0 


ke 


ki  —  ^  I  21.90-tt  1  = 


T 

'  e 


10.95r„ 


ksT s2 
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ts,  the  settling  time  is  chosen  to  be  approximately  — ,  where  fc  is  the  center  frequency  of  a 

Jc 

gammatone  filter.  The  FDL  operation  is  not  very  sensitive  to  the  choice  of  these  parameters. 


IV.  SIMULATION  RESULTS 

The  SCFB  algorithm  has  been  tested  with  appropriate  parameter  choices  using  several 
synthetic  signals  and  speech  signals  drawn  from  the  TIMIT  database.  Here  simulation  results 
are  presented  for  one  set  of  synthetic  musical  notes,  an  isolated  utterance  drawn  from  the 
ISOLET  database,  and  a  set  of  sentences  of  continuous  speech  from  the  TIMIT  database 
with  and  without  additive  noise.  For  speech  signals,  the  input  signal  is  first  subjected  to 
spectral  equalization  by  using  a  pre-emphasis  filter  and  then  processed  through  the  hlterbank 
and  the  self  tuning  FDL  circuits.  The  frequencies  of  the  VCOs  in  FDL  modules  indicate  the 
frequency  components  that  those  modules  are  tracking  and  they  are  plotted  as  a  function 
of  time.  The  outputs  of  the  BPF  triplets  are  available  for  further  processing,  and  these  can 
be  used  to  classify  whether  the  signal  in  local  frequency  bands  are  tonal  or  noise-like.  For 
example,  if  the  envelope  of  the  three  filter  outputs  are  larger  than  the  background  noise  level 
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and  if  the  center  filter  has  a  significantly  larger  output  when  compared  with  the  associated 
left  and  the  right  filters,  then  this  implies  that  the  corresponding  channel  has  a  tonal  signal. 
Conversely,  if  the  three  envelopes  are  approximately  equal  in  size  then  this  implies  that  the 
channel  output  is  non-tonal  or  locally  white. 


A.  Dyads  of  synthetic  harmonic  signals 

The  hlterbank  response  to  synthetic  harmonic  signals  is  considered  first.  The  stimulus 
consists  of  two  notes  of  two  harmonic  complexes  (equal  amplitude  harmonics,  1  to  6).  In 
musical  terms,  these  are  two  notes  separated  by  a  a  minor  second  (16:15)  and  a  perfect 
fourth  (4:3).  They  are  the  same  signals  that  produced  the  auditory  nerve  interspike  interval 
patterns  depicted  in  Figure  2.  The  first  note  has  two  fundamentals  (440  and  469  Hz) 
separated  by  6.6%.  The  second  has  a  frequency  separation  of  33.3%  (with  fundamental 
frequencies  440  and  587  Hz).  Perceptually,  for  the  minor  second,  human  listeners  hear  only 
one  pitch  intermediate  in  frequency  between  the  two  notes,  whereas  for  the  perfect  fourth, 
two  note  pitches  can  be  heard. 

Responses  of  the  SCFB  to  these  pairs  of  complex  harmonic  tones  are  shown  in  Figure 
13.  A  ” capturegram”  plot  of  the  resulting  frequency  tracks  of  the  VCOs  as  a  function  of 
time  shows  the  locking  of  groups  of  channels  onto  individual  frequency  components.  The 
plots  show  only  tracks  of  VCO  frequencies  of  low  frequency  channels  (/c  <  1000  Hz)  to 
permit  more  direct  comparison  with  the  interspike  interval  histograms  in  Figure  2.  Note 
that  most  of  the  VCO  frequency  tracks  with  CFs  close  to  the  dominant  tone  frequencies 
converge  rapidly  (within  a  few  tens  of  milliseconds)  to  their  steady  state  value. 

The  hlterbank  response  for  two  closely  spaced  note  dyads  separated  by  6.6%  is  shown 
in  Figure  13a.  This  signal  has  4  frequency  components  below  1000  Hz:  440,  469,  880,  and 
938  Hz.  Here  the  hlterbank  does  not  resolve  the  pairs  of  nearby  partials  (440/469  and 
880/938  Hz),  but  rather  all  the  channels  converge  on  the  mean  frequencies  of  the  nearby 
partials  (channels  53  to  88  fluctuate  around  458  Hz,  89-112  huctuate  around  909  Hz).  The 
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pattern  of  frequency  capture  is  similar  to  that  in  the  interspike  interval  data  in  Figure  2a. 
Figure  13b  shows  rectified  outputs  of  each  channel’s  center  filter  and  Figure  13c  shows  the 
autocorrelation  of  the  rectified  outputs  (from  time  t  =  0.25  to  0.5  seconds).  In  this  case  we 
can  see  the  fluctuations  in  envelope  are  related  to  the  beat  frequency  (469-440=29  Hz)  (as 
seen  in  Figure  2a). 

The  interbank  response  to  the  well-separated  note  dyad  is  shown  in  Figure  13d.  This 
signal  has  3  frequency  components  below  1000  Hz:  440,  587,  and  880  Hz.  Clearly  each  VCO 
is  captured  by  the  dominant  partial  in  that  channel’s  neighborhood.  Channels  with  center 
frequencies  between  300  and  525  Hz  lock  to  440  Hz,  those  with  center  frequencies  between  525 
Hz  and  725  Hz  lock  to  587  Hz,  and  the  rest  are  captured  by  the  880  Hz  partial.  Transitions 
of  VCO  frequency  change  from  one  dominant  tone  to  the  other  is  abrupt.  For  example, 
for  center  frequencies  near  500  Hz,  the  channels  are  either  captured  by  440  Hz  tone  or  the 
587  Hz  tone.  Very  similar  behavior  is  also  observed  in  the  interspike  interval  histograms  in 
Figure  2b  where  interspike  intervals  in  the  corresponding  CF  channels  switch  abruptly  from 
interval  patterns  associated  with  440  Hz  to  those  associated  with  587  Hz.  Figure  13e  shows 
rectified  outputs  of  each  channel’s  center  filter  and  Figure  13f  shows  the  autocorrelation 
of  the  rectified  outputs  after  the  frequency  estimates,  which  are  almost  constant  (in  other 
words  the  channel’s  VCO  are  locked,  in  this  case  from  time  =  0.25  to  0.5  seconds). 
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FIG.  13.  Filterbank  responses  to  pairs  of  harmonic  tones.  Left.  Responses  to  a  note  dyad 
separated  by  a  minor  second  (AFo=6.6%,  F$ s  =  440  &  469  Hz).  Right.  Responses  to  a  note 
dyad  separated  by  a  perfect  fourth  (AF0=33.3%,  F0s  =  440  &  587  Hz).  Top  plots  (a),(d). 
Frequency  tracks  of  the  VCOs  (capturegram).  Middle  plots  (b),  (e).  Half-wave  rectified 
output  waveforms  of  channel  center  filters  (analogous  to  a  post-stimulus  time  neurogram). 
Bottom  plots  (c),  (f).  Channel  autocorrelations  (compare  with  autocorrelation  neurograms 
of  Figure  2). 
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B.  Speech  signals 


For  synthetic  signals,  such  as  the  musical  notes  in  the  previous  subsection,  the  instan¬ 
taneous  frequency  estimates  obtained  from  the  VCOs  of  nearby  channels  are  essentially  the 
same  after  the  initial  settling  time.  However,  for  natural  signals  like  speech  the  frequency 
estimates  of  the  partials  tend  to  have  some  variability  (as  can  be  seen  below).  Clearly,  some 
sort  of  clustering  method  is  needed  to  obtain  the  average  frequency  tracks  associated  with 
each  frequency  component  in  the  signal.  Other  well  known  auditory-inspired  models  such  as 
the  ZCPA  (Zero-Crossing  Peak  Amplitude)34  or  EIH  (Ensemble  Interval  Histogram)12  use 
the  upward-going  zero  or  level  crossing  events  in  a  signal  (emanating  from  a  filter  channel) 
to  estimate  the  frequency.  The  reciprocal  of  the  time  interval  between  adjacent  zero/lcvel 
crossing  events  is  used  as  the  instantaneous  frequency  estimate.  Such  frequency  estimates 
obtained  over  a  time  window  are  collected  to  assemble  a  frequency  histogram.  The  frequency 
histograms  across  all  filter  channels  are  combined  (in  both  ZCPA  and  EIH)  to  represent  the 
output  of  the  auditory  model34.  Further,  in  ZCPA  the  peak  of  the  envelope  that  lies  in 
between  two  consecutive  zero-crossing  events  is  used  as  a  nonlinear  weighting  factor  to  a 
frequency  bin  to  simulate  the  firing  rate  of  the  auditory  nerve.  In  our  case  we  follow  a  similar 
procedure  except  the  frequency  estimates  are  not  derived  from  the  zero-crossing  events  but 
from  the  VCOs  frequencies.  The  envelopes  are  obtained  from  the  rectified  and  smoothed 
outputs  of  the  center  filter  of  each  channel. 


The  frequency  values  corresponding  to  the  200  channels  are  binned  into  40  logarithmi¬ 
cally  spaced  frequency  bins  that  lie  between  100  and  4000  Hz.  However,  before  binning  the 
frequency  values,  a  non-linear  weighting  factor  (log(l+a),  where  a  is  the  amplitude/envelope 
corresponding  to  that  frequency  value)  was  applied  as  in  ZCPA.  Then  the  histogram  peaks 
that  have  heights  below  a  threshold  (10%  of  the  peak  amplitude)  are  eliminated.  This  will 
eliminate  the  silence  regions  where  the  amplitudes  are  very  low.  Only  when  the  log-envelope 
value  is  above  the  threshold,  the  actual  frequency  estimate  of  the  frequencies  in  the  bin  are 

,  where  a„  and  fn  represent  the  amplitude/envelope 


calculated  using 


En  M1  + 
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and  frequency  values  that  fall  within  a  bin.  The  steps  involved  in  the  processing  of  speech 
signals  are  sketched  in  Figure  14a. 

A  histogram  of  the  distribution  of  frequencies  tracked  by  the  VCOs  is  useful  for  assessing 
the  degree  to  which  channels  have  converged  on  particular  frequencies.  Here  the  number 
of  channels  converging  on  a  particular  frequency  provides  a  robust,  qualitative  measure  of 
its  relative  intensity.  The  running  histogram  of  frequencies  tracked  (Figure  14a)  provides 
a  cleaner  analysis  of  the  time  courses  of  dominant  signal  periodicities.  Thresholding  the 
running  capture  histogram  keeps  regions  where  multiples  channels  have  converged  on  the 
same  frequency  and  removes  those  where  there  is  little  agreement.  Figures  14(b,c,  and  d), 
15  and  16  demonstrate  the  character  of  this  analysis. 


C.  Isolated  spoken  letters 

The  SCFB  algorithm  was  applied  to  a  vowel  j\j  (as  in  “beet”) (hie  name:  fskesO-El- 
t.adc,  male  speaker)  drawn  from  the  ISOLET  database.  Figure  14(b,c,d)  shows  the  simu¬ 
lation  results.  Figure  14b  shows  the  spectrogram  of  the  vowel  utterance  and  14c  shows  the 
capturegram  ,  i.e.  the  raw  frequency  tracks  of  the  200  VCOs. 

It  can  be  seen  that  the  FDLs  track  closely  the  frequencies  of  the  individual  partials  up 
to  at  least  1000  Hz.  Depending  on  the  relative  intensity  of  each  partial,  typically  five  to  ten 
channels  tend  to  converge  on  to  the  stronger  partials’  frequency  tracks.  The  Erst  formant  Fx 
is  located  at  around  300  Hz  between  the  second  and  third  harmonics.  At  higher  frequencies 
(>  2000  Hz),  where  the  filters  (the  gammatone  and  BPFs  tend  to  be  wider)  several  channels 
tend  to  converge  on  the  three  higher  formant  frequencies  which  are  located  approximately  at 
frequencies  2400,  2800  and  3800  Hz.  Between  the  first  and  the  second  formant  frequencies 
where  the  signal  energy  is  relatively  low  there  are  no  dominant  tones  and  hence  the  VCO 
tracks  tend  to  wander.  Figure  14d  shows  the  cleaned  up  tracks  after  the  histogramming 
procedure  outlined  in  Figure  14a  is  applied.  This  procedure  tends  to  suppress  meandering 
tracks  and  signal  components  with  small  envelope  values. 
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Gamma  tone  Filter  Bank 


(a) 


FIG.  14.  (a)  Steps  involved  in  the  SCFB  algorithm.  The  input  speech  signal  s(t)  (after 
preemphasis)  is  processed  by  the  200  gammatone  filters  and  the  associated  FDLs  and  the 
frequency  tracks  are  plotted  as  capturegrams.  The  VCO  frequency  values  and  the  associated 
envelopes  are  used  to  generate  the  frequency  histograms  from  which  dominant  frequency 
tracks  are  derived.  Results  for  ISOLET  vowel  j\j .  (b)  Spectrogram  (c)  Capturegram  (d) 
Thresholded  histogram  plot. 


D.  Continuous  speech 

The  SCFB  algorithm  was  also  applied  to  several  continuous  speech  samples  drawn  from 
the  TIMIT  database.  The  speech  signals  were  first  pre-emphasized  with  a  H{z)  =  1— 0.95z-1 
filter  to  equalize  the  spectrum  to  prevent  strong  low  frequency  components  from  swamping 
the  weaker  high  frequency  components.  The  sampling  frequency  is  16kHz.  Capturegrams  for 
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two  speech  sentences,  “Where  were  you  while  we  were  away?”  (TIMIT  sx9)  and  “The  oasis 
was  a  mirage”  (TIMIT  sx280)  spoken  by  male  and  female  speakers  are  shown  in  Figures  15 
and  16,  respectively. 

Figures  15a  and  15d  show  the  spectrograms  of  the  TIMIT  sx9  utterances  by  male  and 
female  speakers.  In  Figure  15b  and  15e  the  corresponding  capturegram  tracks  for  the  200 
VCOs  are  superimposed  on  the  spectrogram  for  the  male  and  female  utterances.  Typically, 
for  a  strong  low-frequency  harmonic  component,  a  handful  of  channels  are  captured  by  one 
harmonic.  Note  that  at  low  frequencies  and  harmonic  numbers  (f  <  800  Hz,  n  <  8)  almost 
all  the  individual  harmonics  tend  to  be  closely  tracked  by  the  FDLs.  These  frequency  tracks 
together  can  provide  a  robust  representation  of  the  fundamental  frequency  (voice  pitch). 
For  higher  frequencies  and  harmonic  numbers,  only  dominant  harmonics  in  formant  regions 
are  tracked.  This  behavior  is  due  to  the  constant  Qs  of  the  filters,  such  that  FDL  triplet 
filters  with  higher  center  frequencies  have  correspondingly  larger  bandwidths,  and  therefore 
cannot  resolve  individual  harmonics.  Instead  these  filters  lock  onto  the  nearest  dominant 
harmonic  component  somewhere  near  the  middle  of  a  formant. 

Similarly,  Figures  16b  and  16e  show  the  capturegrams  for  the  sentence  TIMIT  sx280  spo¬ 
ken  by  a  male  and  a  female,  respectively.  In  both  cases,  the  frequency  transitions,  especially 
at  the  higher  frequency  regions  are  precisely  and  robustly  tracked.  At  lower  frequencies,  as 
one  harmonic  becomes  weaker  with  respect  to  a  nearby  harmonic,  the  frequency  tracks  of 
channels  in  that  neighborhood  jump  from  the  weaker  harmonic  to  the  stronger  one  clue  to 
the  tendency  of  the  FDL  to  track  the  stronger  component  (as  in  the  time-frequency  region  t 
=  1.0  -1.45  s,  frequency  <  1000  Hz)  in  Figure  16e.  Again  the  last  rows  of  both  figures  show 
the  tracks  after  the  histogramming  procedure  is  used  to  clean  up  the  raw  tracks  data. 

Previous  analysis  of  cat  auditory  nerve  responses  had  suggested  that  the  synchrony 
capture  effect  is  resistant  to  noise35.  So,  we  tested  the  SCFB  algorithm  with  noisy  speech 
signals  to  determine  its  robustness  to  noise.  Signal  power  Ps  is  calculated  as  the  sum  of 
squares  of  all  the  speech  signal  samples  divided  by  the  time  duration  of  the  speech  signal. 
The  variance  a2  is  obtained  from  the  definition  of  signal  to  noise  ratio  (SNR)  given  below. 
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SNR  =  10  log10  dB.  (26) 

The  Gaussian  distributed  noise  samples  are  generated  with  a  variance  a2  obtained  from 
the  above  formula  for  an  SNR  of  10  dB.  The  generated  noise  samples  are  added  to  the 
speech  signals,  and  are  processed  by  the  SCFB  algorithm.  Figure  17  shows  the  simulation 
results.  Left  column  corresponds  to  “The  oasis  was  a  mirage”  (sx280)  for  a  female  speaker, 
and  the  right  column  is  for  “Where  were  you  while  we  were  away?”  (sx9)  by  a  male  speaker. 
The  spectrograms  (a)  and  (d)  are  relatively  darker  than  the  spectrograms  in  Figures  15 
and  16,  because  of  the  added  lOdB  noise.  Even  in  these  noise  corrupted  cases,  the  formant 
and  harmonics’  tracks  (especially  the  formant  transitions)  are  clearly  visible.  Capturegrams 
show  that  multiple  channels  still  merge  to  the  same  frequencies  and  the  histogram  tracks 
are  also  relatively  clean.  Thus  the  behavior  of  the  SCFB  in  noise  seems  to  parallel  that  seen 
in  the  cat  auditory  nerve. 
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FIG.  15.  Results  for  TIMIT  utterance,  “Where  were  you  while  we  were  away?”  (sx9)  for  male 
(left  column)  and  female  (right  column)  speakers.  Top  plots  (a)(d).  Spectrograms.  Middle 
plots  (b)(e).  Captnregrams.  Bottom  plots  (c)(f).  Thresholded  histogram  plots.  At  low 
frequencies,  all  individual  harmonics  are  tracked,  whereas  above  1000  Hz,  only  prominent 
formant  harmonics  are  tracked. 
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FIG.  16.  Results  for  TIMIT  utterance  “The  oasis  was  a  mirage”  (sx280)  for  male  (left 
column)  and  female  (right  column)  speakers.  Plots  as  in  the  previous  figure.  High  frequency 
frication  above  4000  Hz  in  “oasis”  not  shown. 
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FIG.  17.  Results  for  two  TIMIT  utterances  in  lOdB  noise.  “The  oasis  was  a  mirage”  (sx280) 
for  a  female  speaker  (left  column)  and  “Where  were  you  while  we  were  away?”  (sx9)  for  a 
male  speaker  (right  column).  Plots  as  in  the  previous  figure. 
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V.  DISCUSSION 


Our  interest  in  synchrony-capture  based  filterbanks  has  been  motivated  by  considera¬ 
tions  of  the  functional  anatomy  and  response  characteristics  of  the  cochlea,  adaptive  filtering 
signal  processing  strategies  in  radar  and  other  artificial  systems,  and  the  possible  role  of  syn¬ 
chrony  capture  in  auditory  nerve  representation  of  complex  sounds.  The  primary  goal  in 
this  first  stage  of  investigation  has  been  to  integrate  these  aspects  into  a  workable  algorithm 
for  tracking  the  major  frequency  components  present  in  an  acoustic  signal. 

A.  Relationship  to  previous  signal  processing  strategies 

As  is  often  the  case,  the  signal  processing  constituents  of  the  SCFB  algorithm  proposed 
here  have  a  long  history.  Frequency  discriminator  loops  (FDLs)  have  been  used  in  digital 
and  analog  communication  systems  for  signal  tracking  for  many  decades2'.  The  frequency 
error  detector  (FED)  circuit  (Figure  4)  is  a  key  component  of  the  FDL  that  senses  the 
difference  between  the  frequency  of  the  input  signal  and  that  of  a  local  VCO  in  order  to 
produce  a  proportional  error  voltage  that  can  be  used  for  steering  purposes. 

Basically  there  are  two  or  three  common  types  of  frequency  error  detector  circuits  that 
are  used  in  practice.  The  quadricorrelator28,29,  briefly  outlined  in  Appendix  A,  is  often  used 
in  communication  systems.  The  other  type,  which  has  been  used  here  in  the  SCFB  design, 
uses  stagger-tuned  filters  and  compares  envelopes  of  filter  outputs  to  derive  running  error 
voltages.  Ferguson  and  Mantey21  originally  proposed  the  use  of  such  adaptable  stagger-tuned 
bandpass  filters  for  frequency  error  detection.  Alternately,  frequency  error  detectors  can  also 
be  implemented  directly  by  using  phase  derivatives  of  a  complex  signal  (see  for  example36,37). 
Wang38  has  designed  a  harmonic  locked  loop  to  track  the  fundamental  frequency  of  a  periodic 
signal  using  this  idea.  However,  these  approaches  require  a  complex  (Hilbert-transformed) 
signal  for  processing. 

In  their  adaptive,  stagger-tuned  design,  Ferguson  and  Mantey  used  the  error  voltage 
(envelope  difference)  to  retune  the  bandpass  filters  directly  by  moving  their  pole  locations. 
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Such  a  design  does  not  use  VCOs  to  tune  the  filters.  Based  on  this  idea  one  could  imagine 
cochlear  filters  where  the  frequency  response  of  a  filter  is  adjusted  by  changing  a  mechanical 
parameter  such  as  stiffness  depending  on  the  envelope  voltage  difference  between  the  left 
and  the  right  filters.  Costas22  used  a  similar  FED,  but  used  the  error  voltage  to  change 
the  frequency  of  a  VCO  that  indirectly  moved  the  left  and  the  right  bandpass  filters  in 
tandem.  The  proposed  approach  is  closer  to  Costas’  method  and  its  variants22,36,38.  The 
main  difference  is  that  a  compressive  (logarithmic)  nonlinearity  is  used  on  the  envelope  of  a 
signal  to  suppress  nearby  weaker  signal  components.  Such  compressive  nonlinearities  have 
the  property  of  favoring  a  stronger  component  in  the  presence  of  other  weaker  ones.  This  is 
the  primary  reason  that  synchrony  capture  occurs. 

The  SCFB  design  is  also  related  to  adaptive  formant  tracking  methods  proposed  earlier 
by  Rao  and  Kumaresan39,  ,  and  subsequently  improved  by  Mustafa  and  Bruce41.  However, 
in  Rao-Kumaresan  approach  the  adaptive  formant  filters  were  controlled  by  measuring  the 
instantaneous  frequency  of  a  complex-valued  signal.  Further,  as  mentioned  earlier,  E1H  and 
ZCPA  algorithms  also  estimate  the  frequency  of  tonal  signals  based  on  the  zero  or  level 
crossing  intervals.  However,  these  may  be  regarded  as  open  loop  methods  for  estimating 
instantaneous  frequencies,  unlike  the  closed  loop  methods  like  FDL. 


B.  Similarities  to  response  characteristics  of  the  cochlea  and  auditory  nerve 

Although  the  SCFB  is  not  a  biophysical  model,  its  signal  processing  behavior  bears 
many  qualitative  similarities  to  response  patterns  in  the  mammalian  cochlea.  First,  the 
mammalian  cochlea  produces  acoustic  emissions,  called  spontaneous  otoacoustic  emissions 
(SPOAEs)42).  The  narrow  spectral  widths  of  these  emissions  suggest  that  they  are  generated 
by  spontaneous  oscillations  in  the  cochlea,  possibly  in  outer  hair  cells.  This  kind  of  behavior 
is  also  characteristic  of  voltage  controlled  oscillators  that  implement  the  FDL  in  the  present 
architecture. 

Second,  it  is  also  well  known42  (p.117)  that  the  cochlea  also  produces  acoustic  emissions 
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at  additional  frequencies  when  two  tones  of  frequency  f\  and  f2  (f2  >  fi)  are  presented. 
Listeners  can  often  hear  discordant  faint  tones  not  present  in  the  original  stimulus.  The 
strongest  of  these  cochlear  distortion  products,  the  cubic  distortion  product  generated  at 
2/i  —  /2  Hz,  is  thought  to  be  a  direct  byproduct  of  cochlear  mechanics,  in  the  form  of  a 
compressive  nonlinearity  in  OHC  response.  The  ensuing  signal  distortions  are  analogous 
to  intermodulation  products  in  communication  systems.  The  FDL  architecture  produces 
similar  combination  tones  as  a  byproduct  of  its  operation.  Consider  the  operation  of  the 
FDL  as  described  in  section  II. B  when  two  simultaneous  tones  with  frequencies  f\  and  /2 
and  corresponding  amplitudes  A\  and  A2  are  applied  as  input.  The  spectrum  of  the  VCO 
output  for  this  stimulus  is  shown  in  Figure  18  for  a  channel  with  center  frequency  1890 
Hz.  fi  =  1950  Hz  and  /2  =  2050  Hz,  A1  —  1  and  A2  =  0.5.  Note  that  the  VCO  locks  on 
to  the  stronger  tone  at  f±  Hz  and  that  the  left  and  the  right  Liters  of  that  channel  adjust 
themselves  such  that  their  average  envelopes  are  equal.  Then  the  resulting  error  signal  e(t) 
is  proportional  to  C  cos(Ao;t)  where  Au  =  2n  x  (f2  —  /j)  and  C  is  a  constant  related  to 
the  ratio  of  amplitudes  A2j A\  (see  Eq.  14).  This  error  signal  then  frequency  modulates 
the  VCO’s  carrier  at  the  dominant  tone  frequency  f\.  The  resulting  frequency  modulated 
VCO  output  has  sideband  components  at  /i  ±  n(f2  —  /i)18  p.180-87.  The  output  spectrum 
in  Figure  18  shows  some  of  the  sidebands  (for  n  —  1  and  2).  Thus  qualitative  parallels 
exist  between  combination  tones  produced  by  live  cochleae  and  the  VCO-driven  frequency 
capture  circuits  of  the  interbank. 

Two-tone  suppression  is  a  third  nonlinear  phenomenon.  Like  the  cochlea,  the  proposed 
hlterbank  produces  both  rate-  and  synchrony-suppression.  Two-tone  rate  suppression  is 
generally  regarded  as  a  nonlinear  property  of  the  cochlea  in  which  the  average  neural  bring 
rate  in  the  region  most  sensitive  to  a  probe  tone  is  reduced  by  the  addition  of  a  suppressor 
tone  at  a  different  nearby  frequency.  For  the  hlterbank,  when  dominant  frequency  compo¬ 
nents  steer  the  tunings  of  local  VCOs  away  from  other  frequencies,  responses  to  less  intense 
secondary  tones  at  those  frequencies  are  attenuated  relative  to  those  produced  when  the 
dominant  tone  is  absent. 
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output  spectrum  of  a  channel  with  center  freq  1 890  Hz 
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frequency  in  Hz  —  > 

FIG.  18.  Distortion  products.  Spectrum  of  VCO  output  signal  of  a  channel  with  center 
frequency  of  1890  Hz  in  response  to  two  pure  tones  at  frequencies  fi  =  1950  Hz  and  /2  =  2050 
Hz  with  amplitudes  A\  =  1  and  A 2  =  0.5  respectively.  Note  occurrences  of  distortion 
products  at  frequencies  f±  ±  n(f 2  —  /i).  These  are  generated  in  frequency  discriminator 
loops  when  VCOs  lock  on  to  dominant  tones  at  f\  but  are  also  frequency  modulated  by  an 
error  signals  consisting  of  a  weak  tones  at  A /  =  /2  —  ,/'i . 

There  is  also  the  related  phenomenon  of  synchrony  suppression.  The  effects  of  two  tonal 
inputs  on  temporal  patterns  of  neural  firing  have  been  extensively  studied.  Auditory  nerve 
fibers  phase-lock  in  response  to  low  frequency  tones  (<  5000  Hz),  i.e.  spikes  are  mainly  pro¬ 
duced  at  particular  phase  angles  of  the  waveform11.  The  degree  of  synchronization  of  spikes 
to  a  given  frequency  can  be  quantified  by  computing  the  vector  strength  ( “synchronization 
index”)  of  the  spike  distribution  as  a  function  of  waveform  phase.  When  the  stimulus  con¬ 
sists  of  two  tones,  Hind  et  al.43  found  that  auditory  nerve  spikes  may  be  phase  locked  to  one 
tone,  or  to  the  other,  or  to  both  tones  simultaneously.  Which  of  these  occurs  is  determined 
by  the  relative  intensities  of  the  two  tones  and  their  frequencies  and  spacings.  Moore11 
summarizes  these  results  as  follows,  “When  phase  locking  occurs  to  only  one  tone  of  a  pair, 
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each  of  which  is  effective  when  acting  alone,  the  temporal  structure  of  the  response  may 
be  indistinguishable  from  that  which  occurs  when  the  tone  is  presented  alone.  Further,  the 
discharge  rate  may  be  similar  to  the  value  produced  by  that  tone  alone.  Thus  the  domi¬ 
nant  tone  appears  to  “capture”  the  response  of  the  neuron.  This  (synchrony)  capture  effect 
underlies  the  masking  of  one  sound  by  another” .  The  tone  that  is  suppressed  ceases  to  con¬ 
tribute  to  the  pattern  of  phase-locking,  and  the  neuron  responds  as  if  only  the  suppressing 
tone  were  present.  The  effect  is  that  the  synchronization  index  of  a  fiber  to  a  given  tone 
is  reduced  by  the  application  of  a  second  tone44.  Similarly,  in  the  hlterbank,  capture  of  a 
given  channel  VCO  by  a  locally  dominant  component  produces  an  output  waveform  having 
the  frequency  of  the  dominant  tone,  causing  the  vector  strength  of  the  dominant  component 
to  increase  at  the  expense  of  those  of  weaker  secondary  ones. 


VI.  CONCLUSIONS 

A  striking  feature  of  the  phase-locked  responses  to  complex  sounds  is  the  phenomenon 
of  “synchrony  capture”3,5,  wherein  an  intense  stimulus  frequency  component  dominates  the 
temporal  firing  patterns  of  auditory  nerve  fibers  innervating  the  corresponding  cochlear 
frequency  region.  The  capture  effect  refers  to  the  almost  exclusive  nature  of  the  phase-locking 
to  the  dominant  component,  such  that  the  output  of  whole  subpopulations  of  auditory  nerve 
fibers  in  a  cochlear  region  respond  in  the  same  way.  Synchrony  capture  may  be  critical  for 
separation  of  concurrent  harmonic  sounds. 

An  adaptive  hlterbank  structure  is  proposed  that  emulates  synchrony  capture  in  the 
auditory  nerve.  This  hlterbank  has  two  parts:  a  hxed  array  of  traditional,  passive  linear 
(gammatone  or  equivalent)  hlters  that  are  cascaded  with  a  bank  of  adaptively  tunable  band¬ 
pass  hlter  triplets.  Envelope  differences  in  the  outputs  of  the  hlters  that  form  the  triplets 
are  used  in  frequency  discriminator  loop  (FDL)  to  steer  their  center  frequencies  with  the 
help  of  a  voltage  controlled  oscillator  (VCO). 

The  resulting  hlterbank  exhibits  many  desirable  properties  for  processing  speech  and 
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other  natural  sounds.  First,  the  number  of  channels  converging  on  a  particular  frequency 
yields  a  robust  means  of  encoding  the  intensity  of  the  driving  frequency  component.  The 
VCOs  track  resolved  harmonics,  which  are  known  to  be  essential  in  determining  the  pitch 
and  for  the  separation  of  concurrent  periodic  sounds.  For  voiced  speech,  the  VCOs  track  the 
strongest  harmonic  in  each  formant  region,  yielding  precise  features  for  formant  tracking. 
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VIII.  APPENDIX  A:  ALTERNATE  FREQUENCY  ERROR  DETECTORS 


The  frequency  error  detector  (FED)  is  a  key  component  of  the  FDL  (see  Figure  4).  In  the 
tone  followers  described  in  section  II  we  used  the  difference  in  (squared)  envelopes  (or  log- 
envelopes)  of  the  outputs  of  Hr(uj )  and  Hl(ui )  as  the  error  signal  e(t).  e(t )  is  proportional 
to  the  difference  between  the  VCO  frequency  uic  and  the  input  (or  dominant)  tone  frequency 
uj\ .  In  section  II  the  specific  type  of  FED  (that  is,  one  that  uses  squared  envelope  differences) 
was  chosen  because  of  its  apparent  functional  similarity  to  the  functioning  of  cochlear  hair 
cells.  (The  inner/outer  hair  cells  act  as  halfwave  rectifiers  followed  by  low-pass  filters). 
Disregarding  such  constraints,  if  computer  implementation  of  a  FDL  is  the  primary  goal, 
then  many  other  FEDs  are  available.  Of  course,  the  frequency  error  signal  could  be  positive 
or  negative  depending  on  whether  ojc  is  greater  or  smaller  than  uj\.  Therefore,  any  method 
that  is  used  to  measure  the  frequency  of  a  single  tone  can  serve  as  a  FED  as  long  as 
it  is  also  capable  of  detecting  the  sign  of  the  frequency  error.  One  such  FED  is  called 
a  Quadricorrelator28.  The  quadricorrelator  (refer  to  Figure  3  in28)  is  input  with  a  tone 
A\  cos(unf  +  9 1)  and  the  VCO  outputs  cos(cuct)  and  sin(u;cf).  The  low  pass  filters  (LPF)  (in 
Figure  3  in28)  retain  only  the  difference  frequency  outputs  an  cos(Acnt  +  6b)  and  sin(Aa;i  + 
6 1).  The  two  differentiator  outputs  after  cross  multiplying  (in  Figure  3  in28)  are  added 
together  to  produce  the  error  signal  which  retains  the  sign  of  the  frequency  error.  Since 
in  our  simulations,  in-phase  and  quadrature-phase  signals  (/  and  Q)  are  available,  complex 
valued  processing  can  also  be  used  to  estimate  frequency  error3'’38,45. 
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IX.  APPENDIX  B:  EXPRESSIONS  FOR  THE  FREQUENCY 
DISCRIMINATOR  CONSTANT  k. 


ks,  defined  in  section  II. A,  is  the  slope  of  the  frequency  discriminator  function  S(u)  at 
c oc.  S{u)  for  the  Simple  Tone  Follower  (STF)  is  defined  as 


<%)  = 


\Hr(u)\2  -  \Hl(uj)\2 
\Hr(u)\2  +  \Hl(u)\2 


where  \Hr(lo)\2  =  \H  (lo  —  ( loc  +  A))  |2  and  \Hr(lo)\2  =  \H  (lo  —  ( loc  —  A))  |2.  Using  H(s )  = 

— ,  H{u)  =  — - — ,  \Hr[u)\2  and  \Hl(uj)\2  are 

s  +  a  jcu  +  a 


\Hr(u)\2  = 


(u  -  (c oc  +  A))2  +  a2 


\Hl{u)\2  =  - ^ - 

(uj  -  (uc  -  A))  +  a2 

Substituting  Eqs.  28  and  29  in  Eq.  27,  we  get 


s  M  = 


2A(cn  —  t uc) 

to2  +  to2  +  A2  —  2u)u>c  +  a 2 ' 


ks  is  obtained  by  taking  the  derivative  of  S(ui)  with  respect  to  u  and  evaluating  at  uj  —  ujc. 

K=\dm  (3D 

au  , ,  A2  +  a2 

LO  —  LOc 

Similarly,  for  the  Dominant  Tone  Follower  (DTF),  ks  is  obtained  by  taking  the  derivative  of 
\Hr(uj)\2 

S(u)  =  log  —  and  evaluating  at  u  =  uc.  It  is  easy  to  show  that 


A2  +  a2 
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FIG.  1  Two  views  of  the  representation  of  vowel-like  sounds  in  the  AN.  a)  Peris- 
timulus  time  histograms  for  cat  ANF  arranged  by  characteristic  frequency  in 
response  to  the  onset  of  a  five- formant  synthetic  vowel  (/da/)  reprinted  from 
Seeker- Walker  and  Searle  (1990) 4 .  (b)  Distribution  of  synchronized  rates  in 
ANFs  in  response  to  a  standard  vowel  /da/  with  three  formants  F\ ,  _F2,  and 
F:i.  F0  =100Hz.  Reprinted  from  Sachs  et  al.  (2002)5 .  4 


FIG.  2  Synchrony  capture  of  adjacent  partials  for  two  frequency  separations.  The 
two  neurograms  show  all-order  interspike  interval  distributions  for  individual 
cat  auditory  nerve  fibers  as  a  function  of  CF  in  response  to  complex  tone 
dyads  presented  100  times  at  60  dB  SPL.  Each  tone  of  the  pair  consisted  of 
equal  amplitude  harmonics  1-6.  New  analysis  of  dataset  originally  reported 
in  Trarno  et  al.  (2001)6.  (a)  Responses  to  a  tone  dyad  a  musical  minor 
second  apart  (16:15,  AF0=6.6%).  Vertial  bars  indicate  CF  regions  where  one 
predominant  interspike  interval  pattern  predominates.  The  CFs  of  the  fibers 
shown  are:  153,  283,  309,  345,  350,  355,  369,  402,  402,  431,  451,  530,  588,  602, 

631,  660,  724,  and  732  Hz.  Misordered  interval  patterns  (single-asterisked 
histograms)  are  likely  due  to  small  CF  measurement  errors,  (b)  Response 
to  a  tone  dyad  a  musical  fourth  apart  (4:3,  AF0=33.3%).  Three  distinct 
interspike  interval  patterns  associated  with  individual  partials  (440,  587,  and 
880  Hz)  are  produced  in  different  CF  bands,  with  abrupt  transitions  between 
response  modes.  One  fiber  shows  locking  to  distortion  product  2 fi  —  /2  near 
its  CF  (double-asterisked  histogram,  2  f\  —  f2  =  293  Hz,  CF  =  283  Hz).  Fiber 
CFs  were  153,  283,  346,  350,  355,  369,  402,  402,  431,  451,  530,  588,  602,  631, 

660,  662,  724,  732,  and  732  Hz .  5 
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FIG.  3  Synchrony  capture  filterbank  (SCFB).  (a)  The  hlterbank  architecture  con¬ 
sists  of  K  constant-Q  gammatone  filters  whose  logarithmically-spaced  center 
frequencies  span  the  desired  audible  frequency  range.  Each  hlterbank  chan¬ 
nel  consists  of  a  frequency  discriminator  loop  (FDL)  cascaded  with  each  of 
the  K  gammatone  filters.  The  output  of  each  channel.  yc(t),  is  obtained  from 
its  center  filter.  See  sections  II  and  III  for  details.  Frequency  responses  of 
fixed  and  tunable  filters  in  the  SCFB.  Bottom  left  panel  (b)  shows  the  fre¬ 
quency  responses  of  fixed  gammatone  filters  (the  black  dots  indicate  that  not 
all  filter  responses  are  shown).  Bottom  right  panel  (c)  shows  the  Frequency 
responses  of  the  tunable  bandpass  filter  (BPF)  triplets  that  adapt  to  the  in¬ 
coming  signal.  One  BPF  triplet  is  associated  with  each  fixed  filter,  such  that 
coarse  filtering  of  the  fixed  gammatone  filters  is  followed  by  additional,  hirer 
filtering  by  tunable  hlters.  The  nested  arrays  of  hxed,  coarse  and  adjustable, 
fine  hlters  are  arranged  in  a  manner  similar  to  a  vernier  scale .  10 


FIG.  4  A  generic  frequency  discriminator  loop  (FDL).  The  error  signal  e(t)  is  a 
measure  of  the  frequency  difference  between  the  input  signal  and  the  VCO. 

See  Figures  5  and  8  for  details  of  specihc  frequency  error  detectors .  14 


FIG.  5  Frequency  error  detector  (FED)  used  in  the  simple  tone  follower  (STF).  Error 
signal  e[t)  is  computed  using  the  formula  e^pl+e^p)  •  The  envelopes  e^(f), 
eR(t),  and  ec(t),  are  obtained  as  I2  +  Q2.  The  /  and  Q  for  center  hlter 
Hc(uj),  are  the  outputs  of  the  LPFs  shown  in  (b).  Hl(uj)  and  Hr(ui )  have 
the  same  structure  but  with  oscillator  frequencies  at  ujc  —  A  and  ojc  +  A 
respectively.  The  discriminator  transfer  characteristics  S(uj)  (thick  line)  and 
magnitude  responses  of  left  and  right  hlters  (thin  lines)  are  shown  in  (c).  .  .  15 
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FIG.  6  Convergence  of  a  BPF  triplet  on  an  input  tone  at  U\.  (a)  Frequency  responses 
of  BPF  triplet  filters  in  relation  to  an  input  tone.  The  input  tone  frequency 
is  U\  =  27rx950  Hz.  Initially  the  L,  C,  and  R  filters  are  centered  at  ujc  —  A  = 
27tx859  Hz,  ujc  =  27rx901  Hz  and  uc  +  A  =  27tx943  Hz,  respectively.  Since 
initially  oq  >  loc,  the  initial  envelope  output  eR{t)  is  greater  than  e^f),  so  the 
normalized  error  e{t)  is  positive.  This  positive  value  of  e(t)  causes  the  VCO 
frequency  ujc  to  increase  until  t oc  equals  uq.  (b)  Time  course  of  envelopes 
e^( t),  ec( t)  and  e#(t).  Note  that  the  envelopes  eR{t)  and  e^(f)  become 
equal  after  some  settling  time  and  that  ec(t)  reaches  a  higher  plateau,  where 


e£(t)=e/j(t)=0.5ec(t).  (c)  VCO  frequency  track  for  the  C  filter .  17 

FIG.  7  Linearized  model  of  the  frequency  discriminator  loop .  18 


FIG.  8  Frequency  error  detector  (FED)  for  the  dominant  tone  follower  (DTF).  The 

error  signal  e{t)  is  computed  using  the  formula  log  y) .  22 

FIG.  9  Behavior  of  a  DTF  in  response  to  two  nearby  tones  of  different  amplitude. 

(a)  Frequency  response  of  BPF  triplet  Liters  and  the  input  tones  (vertical 
arrows,  dominant  tone  at  uq  =  27rx950  Hz,  plus  a  half-amplitude  interfering 
tone  at  uq  =  27rxl050  Hz.  (b)  Track  of  the  VCO  frequency  for  the  center 
filter  C.  With  minor  fluctuations,  the  VCO  tracks  the  stronger  950  Hz  tone 
in-spite  of  the  weaker  1050  Hz  interferer .  24 

FIG.  10  (a)  Tunable  cos-cos  filter,  (b)  cos-sin  Liter,  (c)  Frequency  responses  Gi(o;) 

and  G-2(oj)  (without  the  scale  factor  j)  are  shown,  (d)  Frequency  responses  of 
the  right  and  left  Liters,  Hr(cu)  and  obtained  as  sum  and  difference 

of  Gi(o;)  and  G2(co)  (Figure  11).  The  Liters  Hr(lo)  and  Hl(lo)  are  basically 
synthesized  from  a  single  prototype  H(cj),  and  hence  are  perfectly  matched 
and  symmetric  about  u>c.  The  frequency  response  of  Hc(co),  not  shown,  is 
centered  around  uc.  All  Liters  are  linear  phase  Liters .  26 
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FIG.  11  Implementation  of  the  frequency  error  detector  and  the  frequency  discrimina¬ 
tor  loop.  The  center  filter  Hc( u)  (not  shown)  is  implemented  using  a  cos-cos 
filter  structure  with  H(u)  sandwiched  between  the  multipliers  as  in  Figure  5b.  27 

FIG.  12  A  typical  BPF  Triplet  centered  at  1980  Hz.  The  broader  frequency  response 

corresponds  to  the  gammatone  filter  centered  around  1980Hz .  29 


FIG.  13  Filterbank  responses  to  pairs  of  harmonic  tones.  Left.  Responses  to  a  note 
dyad  separated  by  a  minor  second  (AF0=6.6%,  F0s  =  440  &  469  Hz).  Right. 
Responses  to  a  note  dyad  separated  by  a  perfect  fourth  (AF0=33.3%,  F0s 
=  440  &  587  Hz).  Top  plots  (a),(d).  Frequency  tracks  of  the  VCOs  (cap- 
turegram).  Middle  plots  (b),  (e).  Half-wave  rectified  output  waveforms  of 
channel  center  Liters  (analogous  to  a  post-stimulus  time  neurogram).  Bot¬ 
tom  plots  (c),  (f).  Channel  autocorrelations  (compare  with  autocorrelation 
neurograms  of  Figure  2) .  33 

FIG.  14  (a)  Steps  involved  in  the  SCFB  algorithm.  The  input  speech  signal  s(t)  (after 
preemphasis)  is  processed  by  the  200  gammatone  Liters  and  the  associated 
FDLs  and  the  frequency  tracks  are  plotted  as  capturegrams.  The  VCO  fre¬ 
quency  values  and  the  associated  envelopes  are  used  to  generate  the  frequency 
histograms  from  which  dominant  frequency  tracks  are  derived.  Results  for 
ISOLET  vowel  /i/.  (b)  Spectrogram  (c)  Capturegram  (d)  Thresholded  his¬ 
togram  plot .  36 

FIG.  15  Results  for  TIMIT  utterance,  “Where  were  you  while  we  were  away?”  (sx9) 
for  male  (left  column)  and  female  (right  column)  speakers.  Top  plots  (a)(d). 
Spectrograms.  Middle  plots  (b)(e).  Capturegrams.  Bottom  plots  (c)(f). 
Thresholded  histogram  plots.  At  low  frequencies,  all  individual  harmonics 
are  tracked,  whereas  above  1000  Hz,  only  prominent  formant  harmonics  are 


tracked 


39 


FIG.  16  Results  for  TIMIT  utterance  “The  oasis  was  a  mirage”  (sx280)  for  male  (left 
column)  and  female  (right  column)  speakers.  Plots  as  in  the  previous  figure. 

High  frequency  frication  above  4000  Hz  in  “oasis”  not  shown .  40 

FIG.  17  Results  for  two  TIMIT  utterances  in  lOdB  noise.  “The  oasis  was  a  mirage” 
(sx280)  for  a  female  speaker  (left  column)  and  “Where  were  you  while  we  were 
away?”  (sx9)  for  a  male  speaker  (right  column).  Plots  as  in  the  previous  figure.  41 
FIG.  18  Distortion  products.  Spectrum  of  VCO  output  signal  of  a  channel  with  center 
frequency  of  1890  Hz  in  response  to  two  pure  tones  at  frequencies  f\  =  1950 
Hz  and  f-2  =  2050  Hz  with  amplitudes  A\  =  1  and  A 2  =  0.5  respectively.  Note 
occurrences  of  distortion  products  at  frequencies  fi  ±  n(/2  —  /1).  These  are 
generated  in  frequency  discriminator  loops  when  VCOs  lock  on  to  dominant 
tones  at  fi  but  are  also  frequency  modulated  by  an  error  signals  consisting 
of  a  weak  tones  at  A  /  =  f 2  —  fi .  45 
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